synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

180
SIGMA: A Sparse and Irregular GE MM Accelerator with Flexible Interconnects for DNN Training ERIC QIN * , ANANDA SAMAJDAR * , HYOUKJUN KWON * , VINEET NADELLA * , SUDARSHAN SRINIVASAN # , DIPANKAR DAS # , BHARAT KAUL # , TUSHAR KRISHNA * http://synergy.ece.gatech.edu * GEORGIA TECH # INTEL 1

Transcript of synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites...

Page 1: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects

for DNN Training

ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,

BHARAT KAUL#, TUSHAR KRISHNA*

http://synergy.ece.gatech.edu

* GEORGIA TECH# INTEL

1

Page 2: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

2

Page 3: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline

3

• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

Page 4: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

4

Page 5: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

5

What is the key computation for these Deep Learning applications?

Page 6: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Runtime breakdown on V100 GPU

6

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Page 7: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Runtime breakdown on V100 GPU

7

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.

Page 8: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

GEMMs in Deep Learning

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

Page 9: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT Output Feature Error

M

K

Input Feature Error

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.

Page 10: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10

Page 11: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

GEMMs in Deep Learning

Backward Pass(Training)

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11

GEMM is a key compute primitive to accelerate in hardware to speed up training.

Page 12: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Hardware for Accelerating GEMMs

12

SIMT Architectures

Nvidia GTX GPUs

Page 13: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

13

SIMD ArchitecturesSIMT Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Hardware for Accelerating GEMMs

Page 14: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

14

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

Hardware for Accelerating GEMMs

Page 15: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

15

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Recently, systolic array based architectures are popular for accelerating GEMMs.

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores

Hardware for Accelerating GEMMs

Page 16: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Target comparison: Google TPU

TPU figure from https://cloud.google.com/tpu/docs/system-architecture

Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.

16

Page 17: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

17

Page 18: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

18

Page 19: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array) MAC PE

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

19

+✕

Weight Buffer

Top input

Streaming input

Streaming output

Load (0) / Accum. (1)

Load (0) / Accum. (1)

0

1

Bottom output

Page 20: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

20

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.

Page 21: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

TPU (Systolic Array)

Mapping GEMMs onto TPUs

21

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

Let’s map this GEMM!

GEMMs used for evaluation.

Page 22: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

22

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Page 23: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

23

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Page 24: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

24

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Page 25: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

StationarymatrixK

= 2

048

N = 256

25

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Page 26: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

26

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Page 27: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

27

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Page 28: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

28

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

Page 29: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

29

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

Page 30: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

30

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

Page 31: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

31

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.

Page 32: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs

32

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down a column.

Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.

Page 33: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Irregularity

33

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

GEMMs used for evaluation.

Page 34: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

34

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 35: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

35

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 36: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

36

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

75% of the PEs are not utilized for this GEMM.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 37: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

37

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

75% of the PEs are not utilized for this GEMM.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 38: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

38

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 39: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

39

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 40: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

40

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 41: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Mapping GEMMs onto TPUs

41

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

Page 42: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Mapping GEMMs onto TPUs

42

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

75% of the PEs are not utilized for this GEMM.

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Mapping GEMMs onto TPUs - Irregularity

The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?

Page 43: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

43

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

Page 44: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

44

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 45: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

45

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 46: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

46

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 47: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

47

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 48: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

48

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 49: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

49

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Can we map another section of the stationary matrix onto the TPU to increase utilization?

128

Mapping GEMMs onto TPUs - Irregularity

Page 50: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

50

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of

the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

128

Page 51: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

51

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.

Can we map another section of the stationary matrix onto the TPU to increase utilization?

Mapping GEMMs onto TPUs - Irregularity

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Page 52: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

52

Page 53: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

53

Page 54: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

54

Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately

30% to 70% from ReLU, dropout, etc.

Page 55: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Sparsity

55

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Usually these GEMMs are sparse!

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

GEMMs used for evaluation.

Page 56: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

56

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Mapping GEMMs onto TPUs - Sparsity

Page 57: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

57

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Mapping GEMMs onto TPUs - Sparsity

Page 58: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

58

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Mapping GEMMs onto TPUs - Sparsity

Page 59: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

59

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

What happens if half of the stationary matrix are zeros?

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

Page 60: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

60

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

87.5% of the PEs are not utilized for this GEMM.

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

Page 61: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

61

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

87.5% of the PEs are not utilized for this GEMM.

Multiplication with an operand that is zero is considered underutilized.

Mapping GEMMs onto TPUs - Sparsity

Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How

can we map only nonzeros stationary?

Page 62: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

62

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

Page 63: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

63

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

Page 64: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

64

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

Page 65: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

65

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

Page 66: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

66

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

Page 67: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

67

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

128

Page 68: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

68

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

TPU (Systolic Array)

15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Reduce partial sum down each column.

Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.

Can we map other nonzero elements where the idle PEs used to be?

Mapping GEMMs onto TPUs - Sparsity

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Page 69: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

Mapping GEMMs onto TPUs - Scalability

StationarymatrixK

= 2

048

N = 256

69

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Page 70: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

70

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 1

Mapping GEMMs onto TPUs - Scalability

Page 71: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

71

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 2

Mapping GEMMs onto TPUs - Scalability

Page 72: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

72

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 3

Mapping GEMMs onto TPUs - Scalability

Page 73: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

StationarymatrixK

= 2

048

N = 256

73

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

Cycle 128Loading takes 128 cycles, which

scales at O(sqrtN).

Mapping GEMMs onto TPUs - Scalability

Page 74: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

74

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

Page 75: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

75

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Page 76: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

TPU (Systolic Array)

76

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)

StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.

Mapping GEMMs onto TPUs - Scalability

Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are

both sparse and irregular.

Page 77: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

77

Page 78: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.

● Support sparsity by only mapping nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

78

Page 79: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

79

Page 80: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(logN) reduction

80

Page 81: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(logN) reduction

81

With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.

Page 82: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic vs SIGMA High Level Interconnects

A B C D

I J K L

4 x 4 Systolic Array

82

A B C D

I J K L

E F G H

M N O P

16 PE SIGMA

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Page 83: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic vs SIGMA High Level Interconnects

83

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Page 84: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic vs SIGMA High Level Interconnects

84

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Page 85: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic vs SIGMA High Level Interconnects

85

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Page 86: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic vs SIGMA High Level Interconnects

86

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution

● Reduction network allows SIGMA to reduce variable sized dot products

● Addresses sparsity and irregularity

● O(logN) reduction

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction

Page 87: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

4 x 4 Systolic Array

a

b

c

d

e

f

g

h

87

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 88: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

88

16 PE SIGMA

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 89: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

a

b

a

b

a

b

a

b

a

b

a

b

a

b

89

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 90: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

c

d

e

f

g

h

e

f

g

h

c

d

c

d

c

d

c

d

d

c

c

d

c

d

90

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 91: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

e

f

e

f

g

h

g

h

e

f

e

f

e

f

e

f

f

e

e

f

e

f

91

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 92: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

I J K L

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

c

d

g

h

e

f

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

92

16 PE SIGMA

Irregular GEMMs on SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 93: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

93

16 PE SIGMA

Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 94: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

94

16 PE SIGMA

Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.

Page 95: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

95

16 PE SIGMA

Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 96: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

E F G H

M N O P

A B C D

I J K L

E F G H

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

96

16 PE SIGMA

Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.

Page 97: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Sparse Irregular GEMMs on SIGMA

A B C D

H I J

4 x 4 Systolic Array

a

b

cd

f

g

e

97

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 98: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

cd

f

g

e

a

b

cd

f

g

e

98

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 99: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

a

b

cd

f

g

e

cd

f

g

e

a

b

a

b

b

b

a

b

a

b

a

b

a

b

99

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast first row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 100: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

ccd

f

g

e

d

f

g

e

c c

c c c c

100

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast second row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 101: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b

dcd

f

g

e

g

e

d d

d d d d

101

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast third row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 102: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

A B C D

H I J

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

a

b e

cd

f

g

e e e

e

e

e e e e

102

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Next cycle: Multicast fourth row of MK to the corresponding stationary elements.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 103: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

C D E

K L M N

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

4 x 4 Systolic Array

e e e

e

e

e e e e

a

b

cd

f

g

e

103

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 104: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

e e e

e

e

e e e e

a

b

cd

f

g

e

104

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Again, the systolic array has to map another part of the stationary matrix and stream MK again.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 105: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

105

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

34 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

Page 106: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

F G

O P

A B J C

H I L K

D E F G

Distribution Network

Reduction NetworkM N O P

16 PE SIGMA Flex-DPE4 x 4 Systolic Array

106

16 PE SIGMA

Sparse Irregular GEMMs on SIGMA

Final cycle count.

Systolic Array total runtime:

34 cycles

SIGMA total runtime:

13 cycles

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.

Page 107: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

107

Page 108: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

O(1) Distribution TopologyUnicast

(Loading Stationary Matrix)

Crossbar

X X X XX X X X X X X X X X X X

Benes Crossbar Benes

108

Source

Destination Destination Destination Destination

Source Source Source

Multicast (Sending Streaming Matrix)

Page 109: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

O(1) Distribution Topology

Crossbar Benes

109

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Multicast (Sending Streaming Matrix)

Page 110: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

O(1) Distribution Topology

Crossbar Benes

110

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)

Multicast (Sending Streaming Matrix)

Page 111: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

O(1) Distribution Topology

Crossbar Benes

Unicast Multicast

111

Source Source

Crossbar BenesSource Source

X X X XX X X X X X X X X X X X

Destination Destination Destination Destination

SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of

switches scale by O(N logN).

Page 112: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time 112

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

Different clusters of partial sums.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Page 113: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 113

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Different clusters of partial sums.

Regular adder tree will accumulate a + b, which is incorrect functionality.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

O(logN) Reduction (Limitation of Adder Tree)

Page 114: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Forwarding Adder Network (FAN)

114

Different clusters of partial sums.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Page 115: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Forwarding Adder Network (FAN)

115

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Page 116: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Pipelined at 1 cycle per adder level.

Forwarding Adder Network (FAN)

116

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

Page 117: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 1: Bypass conflicting partial sums

117

Forwarding Adder Network (FAN)

Bypass adder and forward partial sum to next level!

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

Page 118: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 2: Partial sum red complete

118

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Forwarding Adder Network (FAN)

Page 119: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 1 21

119

Forwarding Adder Network (FAN)

Cycle 3: Partial sums green and orange complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer sent to output buffer

Page 120: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Cycle 4: Partial sum blue complete

120

Forwarding Adder Network (FAN)

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Page 121: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

121

Forwarding Adder Network (FAN)

Cycle 5: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

sent to output buffer

Page 122: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

122

Forwarding Adder Network (FAN)

Cycle 5: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+

send to output buffer

Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.

Page 123: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

123

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can easily replace any other accelerators that use adder trees.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Page 124: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

124

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Page 125: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

125

Forwarding Adder Network (FAN)

Cycle 4: Partial sum purple complete

Different clusters of partial sums.

FAN is optimized for floating point reductions, commonly used during DNN training.

Our novel FAN topology is both lightweight and flexible.

It can replace regular adder trees in other hardware accelerators.

More details such as the routing algorithm and overhead analysis can be found in the paper.

Page 126: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

126

Page 127: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 127

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Page 128: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 128

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Page 129: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 129

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Page 130: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 130

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Page 131: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 131

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Accumulation SRAM ● Buffer for partial sum accumulations

Page 132: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 132

SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required

Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements

Accumulation SRAM ● Buffer for partial sum accumulations

SIGMA Engine● Compute engine

Page 133: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Log2(N) reduction time 133

SIGMA High Level Diagram

X

x x x x

- - - -

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

SIGMA Flex-DPE 0

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

SIGMAFlex-DPE N

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.

+

SIGMA Engine● Compute engine

Page 134: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

134

Page 135: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Methodology

135

● Hardware components are written in Verilog

● Post layout area and power numbers are on a 28nm process

● Analytical model for cycle counts assumes uniform random sparsity

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.

Page 136: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Dense GEMMs

136

Page 137: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Dense GEMMs

For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.

137

Page 138: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Dense GEMMs

SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.

138

Page 139: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Dense GEMMs

SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.

139

SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.

Page 140: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Sparse GEMMs

140

Page 141: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Sparse GEMMs

141

SIGMA enables speedup by only mapping nonzero elements stationary.

Page 142: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs TPU - Sparse GEMMs

142

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.

Page 143: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA vs Sparse Accelerators

143

Page 144: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA Qualitative Analysis

144

Page 145: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA Qualitative Analysis

145

SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can

be found in the paper.

Page 146: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic Array vs SIGMA Comparison

146

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

Page 147: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic Array vs SIGMA Comparison

147

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA consumes 38% more area and 82% more power than Systolic Array.

Page 148: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic Array vs SIGMA Comparison

148

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

Page 149: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Systolic Array vs SIGMA Comparison

149

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.

SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

SIGMA consumes more resources but achieves higher effective TFLOPS/W.

Page 150: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

150

Page 151: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

151

Page 152: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

152

Page 153: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

153

Page 154: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

154

Page 155: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

155

Page 156: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

● High utilization from systolic arrays is challenging because of their rigid structure.

● SIGMA enables high compute utilization on sparse irregular GEMMs.

● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

● Future work: Optimizations such as power gating and software stack design.

Thank you! 😊

156

Page 157: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Material Backup

157

Page 158: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

0 1 1 1

1 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Dim. Registers

M-dim 3

N-dim 4

K-dim 4

Compressed bitmap equivalent

Stationary Bitmap

Streaming Bitmap

A B C D E F G H I

Stationary Nonzero data

a b c d e f g

Streaming Nonzero data

158

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 159: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

159

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 160: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

160

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 161: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

161

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 162: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

162

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 163: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

0 1 1 1

1 1 1 0

1 1 1 0

Stationary’ Bitmap

163

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 164: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

164

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 165: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

165

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

The main idea is to find the necessary nonzero elements to map stationary.

Page 166: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

x x x x

A B D E

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

Flex-DPE 0 Flex-DPE 1

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

166

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Cycle 1: unicast elements stationary to Flex-DPE 0

+

Page 167: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E- - - -

X

- - - -

Flex-DPE 0 Flex-DPE 1

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

167

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Cycle 2: unicast elements stationary to Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Page 168: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output BitmapRow ID0

1

2

0

1

168

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Page 169: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

169

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Row ID

Page 170: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0

Stationary’ Bitmap

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

170

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Row ID

The main idea is to find the matching source and destination indices.

Page 171: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

a f - - a f - -

171

Cycle 3: multicast 1st column of streaming matrix and reduce

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Page 172: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

a f - - a f - -

172

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Cycle 3: multicast 1st column of streaming matrix and reduce

Page 173: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

b d - - b d - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

173

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

e

d - b d - b d -

Bf Da-- -- Ga --Ff Ifx x x x

+ +

+

x x x x

+ +

+

+

Cycle 4: multicast 2nd column of streaming matrix and reduce

Page 174: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I

e - - - e - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

174

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

e - - e - - e -

c g

-- DbAd Ed Gb Hd-- --

Bf Ga If

x x x x

+ +

+

x x x x

+ +

+

+

Da Ff

Cycle 5: multicast 3rd column of streaming matrix and reduce

Page 175: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - g c - g c - g

c g - - c g - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

175

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- -- Ee -- He-- --

Ga + If

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

Ae

Ad --

Da FfGb Hd

Cycle 6: multicast last column of streaming matrix and reduce

Page 176: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

176

Cycle 7: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- Dc -- Gc --Fg Ig

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Bg--

Ae --

Da + Ff

Db + Ed -- Gb + Hd

-- He

Page 177: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

177

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

-- -- -- -- ---- --

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

----

Bg Dc Fg

Ee -- He

Gc Ig

Cycle 8: FAN Reduction

Page 178: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

178

Cycle 9: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Dc Fg Gc + Ig

Page 179: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

Walkthrough

X

A B D E

XDistribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

179

Cycle 10: FAN Reduction

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Dc + Fg

x x x x

+ +

+

x x x x

+ +

+

+

Page 180: synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

180