Systolic Architecture

19
Systolic Architecture • Conventional architecture operate on load and store operations from memory. • This requires more memory references which slows down the system as shown below:

description

Systolic Architecture. Conventional architecture operate on load and store operations from memory. This requires more memory references which slows down the system as shown below:. Systolic Architecture. - PowerPoint PPT Presentation

Transcript of Systolic Architecture

Page 1: Systolic Architecture

Systolic Architecture• Conventional architecture operate on load

and store operations from memory.• This requires more memory references which

slows down the system as shown below:

Page 2: Systolic Architecture

Systolic Architecture

• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:

Page 3: Systolic Architecture

Systolic Architecture• The basic architecture constitutes processing

elements (PEs) that are simple and identical in behavior at all instants.

• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by

the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.

Page 4: Systolic Architecture

Systolic Architecture• PEs at the boundary of structure are connected

to memory • Data picked up from memory is circulated

among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic

• Example : Multiplication of two n x n matrices

Page 5: Systolic Architecture

Example : Multiplication of two n x n matrices

• Every element in input is picked up n times from memory as it contributes to n elements in the output.

• To reduce this memory access, systolic architecture ensures that each element is pulled only once

• Consider an example where n = 3

Page 6: Systolic Architecture

Matrix Multiplicationa11 a12 a13a21 a22 a23a31 a32 a33 *

b11 b12 b13b21 b22 b23b31 b32 b33

=c11 c12 c13c21 c22 c23c31 c32 c33

Conventional Method: O(n3)

For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Page 7: Systolic Architecture

Systolic MethodThis will run in O(n) time!

To run in n time we need n x n processing units, in our example n = 9.

P9P8P7

P6P5P4

P1 P2 P3

Page 8: Systolic Architecture

For systolic processing, the input data need to be modified as:

a13 a12 a11a23 a22 a21a33 a32 a31

b31 b32 b33b21 b22 b23b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

Page 9: Systolic Architecture

At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31b21b11

b32b22b12

b33b23b13

P9P8P7

P6P5P4

P1 P2 P3

Page 10: Systolic Architecture

3 4 2 2 5 33 2 5

* =

3 4 2 2 5 33 2 5

23 36 28 25 39 3428 32 37

Using a systolic array.

2 4 3

3 5 2

5 2 3

323

254

532

P9P8P7

P6P5P4

P1 P2 P3

Page 11: Systolic Architecture

P1 9

P2 0

P3 0

P4 0

P5 0

P6 0

P7 0

P8 0

P9 0

2 4

3 5 2

5 2 3

32

254

532

P9P8P7

P6P5P4

3*3 P2 P3

Clock tick : 1

Page 12: Systolic Architecture

P1 9+8=17

P2 12

P3 0

P4 6

P5 0

P6 0

P7 0

P8 0

P9 0

2

3 5

5 2 3

325

532

P9P8P7

P6P52*3

4*2 3*4 P3

Clock tick : 2

Page 13: Systolic Architecture

P1 17+6=23

P2 12+20=32

P3 6

P4 6+10=16

P5 8

P6 0

P7 9

P8 0

P9 0

3

5 2

2

53

P9P83*3

P62*45*2

2*3 4*5 3*2

Clock tick : 3

Page 14: Systolic Architecture

P1 23

P2 32+4=36

P3 6+12=18

P4 16+9=25

P5 8+25=33

P6 4

P7 9+4=13

P8 12

P9 05

5

P93*42*2

2*25*53*3

23 2*2 4*3

Clock tick : 4

Page 15: Systolic Architecture

P1 23

P2 36

P3 18+10=28

P4 25

P5 33+6=39

P6 4+15=19

P7 13+15=28

P8 12+10=22

P9 63*22*55*3

5*33*225

23 36 2*5

Clock tick : 5

Page 16: Systolic Architecture

P1 23

P2 36

P3 28

P4 25

P5 39

P6 19+15=34

P7 28

P8 22+10=32

P9 6+6=122*35*228

3*53925

23 36 28

Clock tick : 6

Page 17: Systolic Architecture

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 12+25=375*53228

343925

23 36 28

Clock tick : 7

Page 18: Systolic Architecture

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 37373228

343925

23 36 28

End

Page 19: Systolic Architecture

Samba: Systolic Accelerator for Molecular Biological Applications

This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.