Lec2 final

35
Systolic Arrays & Systolic Arrays & Their Applications Their Applications

description

 

Transcript of Lec2 final

Page 1: Lec2 final

Systolic Arrays & Their Systolic Arrays & Their ApplicationsApplications

Page 2: Lec2 final

OverviewOverview

What it isN-body problem Matrix multiplicationCannon’s methodOther applicationsConclusion

Page 3: Lec2 final

Introduction – “Systolic”Introduction – “Systolic”

Page 4: Lec2 final

What Is a Systolic Array?What Is a Systolic Array?A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions

Each processor at each step takes in data from one or more neighbors (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East).

H. T. Kung and Charles Leiserson were the first to publisha paper on systolic arrays in 1978, and coined the name.

Page 5: Lec2 final

What Is a Systolic Array?What Is a Systolic Array?

A specialized form of parallel computing.Multiple processors connected by short

wires. Unlike many forms of parallelism which

lose speed through their connection.Cells(processors), compute data and store it

independently of each other.

Page 6: Lec2 final

Systolic Unit(cell)Systolic Unit(cell)

Each unit is an independent processor.Every processor has some registers and an

ALU.The cells share information with their

neighbors, after performing the needed operations on the data.

Page 7: Lec2 final

Some simple examples of systolic array models.

Page 8: Lec2 final

N-body ProblemN-body Problem

Conventional method: N2

Systolic method: NWe will need one processing element for

each body, lets call them planets.

Page 9: Lec2 final

B1

B2B3

B6

B4

B5

Six planets in orbit with different masses, and different forces acting on each other, so are systolic array will look like this.

Page 10: Lec2 final

Each processor will hold the newtonian force formula:Fij = kmimj/d2

ij , k being the gravitational constant.

Then load the array with values.

B1 B2 B3 B4 B5 B6

Now that the array has the mass and the coordinates ofEach body we can begin are computation.

Page 11: Lec2 final

B1m B1c

B2m B2c

B5m B5c

B4m B4c

B6m B6c

B3m B3c

B6, B5, B4, B3, B2, B1

Fij = kmimj/d2ij

Page 12: Lec2 final

B1m B1c

B2m B2c

B5m B5c

B4m B4c

B6*B1B3m B3c

B6, B5, B4, B3, B2

Fij = kmimj/d2ij

Page 13: Lec2 final

B1m B1c

B2m B2c

B5*B1B4m B4c

B6*B2B3m B3c

B6, B5, B4, B3

Fij = kmimj/d2ij

Page 14: Lec2 final

B1m B1c

B2m B2c

B5*B2B4*B1 B6*B3B3m B3c

B6, B5, B4

Fij = kmimj/d2ij

Page 15: Lec2 final

B1m B1c

B2m B2c

B5*B3B4*B2 B6*B4B3*B1

B6, B5

Fij = kmimj/d2ij

Page 16: Lec2 final

B1m B1c

B2*B1 B5*B4B4*B3 B6*B5B3*B2

B6

Fij = kmimj/d2ij

Page 17: Lec2 final

Done Done DoneDone DoneDone

Fij = kmimj/d2ij

Page 18: Lec2 final

Matrix MultiplicationMatrix Multiplication

a11 a12 a13

a21 a22 a23

a31 a32 a33*

b11 b12 b13

b21 b22 b23

b31 b32 b33

=c11 c12 c13

c21 c22 c23

c31 c32 c33

Conventional Method: N3

For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Page 19: Lec2 final

Systolic MethodSystolic MethodThis will run in O(n) time!

To run in N time we need N x N processing units, in this casewe need 9.

P9P8P7

P6P5P4

P1 P2 P3

Page 20: Lec2 final

We need to modify the input data, like so:

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31 b32 b33

b21 b22 b23

b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

Page 21: Lec2 final

P9P8P7

P6P5P4

P1 P2 P3a13 a12 a11

a23 a22 a21

a33 a32 a31

b31

b21

b11

b32

b22

b12

b33

b23

b13

At every tick of the global system clock data is passed to eachprocessor from two different directions, then it is multiplied and the result is saved in a register.

Page 22: Lec2 final

3 4 2

2 5 33 2 5

* =

3 4 2

2 5 33 2 5

23 36 28

25 39 3428 32 37

Lets try this using a systolic array.

P9P8P7

P6P5P4

P1 P2 P32 4 3

3 5 2

323

5 2 3

532

254

Page 23: Lec2 final

3*32 4

3 5 2

32

5 2 3

532

254

Clock tick: 1

9 0 0 0 0 0 0 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 24: Lec2 final

2*3

4*2 3*42

3 5

3

5 2 3

532

25

Clock tick: 2

17 12 0 6 0 0 0 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 25: Lec2 final

3*3

2*45*2

2*3 4*5 3*2

3

5 2

532

Clock tick: 3

23 32 6 16 8 0 9 0 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 26: Lec2 final

3*42*2

2*25*53*3

2*2 4*3

5

5

Clock tick: 4

23 36 18 25 33 4 13 12 0

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 27: Lec2 final

3*25*25*3

5*33*2

2*5

Clock tick: 5

23 36 28 25 39 19 28 22 6

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 28: Lec2 final

2*35*2

3*5

Clock tick: 6

23 36 28 25 39 34 28 32 12

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 29: Lec2 final

5*5

Clock tick: 7

23 36 28 25 39 34 28 32 37

P1 P2 P3 P4 P6P5 P7 P8 P9

Page 30: Lec2 final

373228

343925

23 36 28

23 36 28 25 39 34 28 32 37

P1 P2 P3 P4 P6P5 P7 P8 P9

Same answer! In 2n + 1 time, can we do better?

The answer is yes, there is an optimization.

Page 31: Lec2 final

Cannon’s TechniqueCannon’s Technique

No processor is idle.Instead of feeding the data, it is cycled.Data is staggered, but slightly different than

before.Not including the loading which can be

done in parallel for a time of 1, this technique is effectively N.

Page 32: Lec2 final

Other ApplicationsOther Applications

Signal processingImage processingSolutions of differential equationsGraph algorithmsBiological sequence comparisonOther computationally intensive tasks

Page 33: Lec2 final

Samba: Systolic Accelerator Samba: Systolic Accelerator for Molecular Biological for Molecular Biological

Applications Applications This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.

Page 34: Lec2 final

Why Systolic? Why Systolic?

Extremely fast.Easily scalable architecture.Can do many tasks single processor

machines cannot attain.Turns some exponential problems into

linear or polynomial time.

Page 35: Lec2 final

Why Not Systolic?Why Not Systolic?

Expensive.Not needed on most applications, they are a

highly specialized processor type.Difficult to implement and build.