A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

18
A Concurrent Matrix A Concurrent Matrix Transpose Algorithm, Transpose Algorithm, The Implementation The Implementation Presented Presented by by Pourya Jafari Pourya Jafari

Transcript of A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Page 1: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

A Concurrent Matrix A Concurrent Matrix Transpose Algorithm, The Transpose Algorithm, The

ImplementationImplementation

PresentedPresented

byby

Pourya JafariPourya Jafari

Page 2: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Review: Algorithm StepsReview: Algorithm Steps

Pre-process inside Pre-process inside each threadeach thread Shift rowsShift rows

Intra-process/thread Intra-process/thread communicationcommunication Shift columnsShift columns

Post-process inside Post-process inside each threadeach thread Shift rows againShift rows again

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

Page 3: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Review: Shift values?Review: Shift values?

Set shifts based on row index : range 0 to N-1Set shifts based on row index : range 0 to N-1

Now arrange the rows, so that column shifts gets Now arrange the rows, so that column shifts gets us to ius to i Preprocess shifting: i’ = i - L Preprocess shifting: i’ = i - L After intra-process shift columns should be equal to After intra-process shift columns should be equal to

original row index ioriginal row index i i’ + j = i i’ + j = i i - L + j = i i - L + j = i L = - j L = - j

So we shift each column j cells upSo we shift each column j cells up

Page 4: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Review: Last step ?Review: Last step ?

1 → 2: Column shift j up1 → 2: Column shift j up

2 → 3: Row shift based on row indices2 → 3: Row shift based on row indices

3 → 4: ?3 → 4: ? Change of indices so farChange of indices so far

(i - j, j) → (i - j, i - j + j) (i - j, j) → (i - j, i - j + j) (i - j, i) = (m, n) (i - j, i) = (m, n) One operation to change row index to jOne operation to change row index to j

n - m = (i - (i - j))= jn - m = (i - (i - j))= j

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

0000 1111 2222 3333

1010 2121 3232 0303

2020 3131 0202 1313

3030 0101 1212 2323

0000 0101 0202 0303

1010 1111 1212 1313

2020 2121 2222 2323

3030 3131 3232 3333

0000 1111 2222 3333

0303 1010 2121 3232

0202 1313 2020 3131

0101 1212 2323 3030

0000 1010 2020 3030

0101 1111 2121 3131

0202 1212 2222 3232

0303 1313 2323 3333

(1) (2-a) (2-b) (3)

(4)

Page 5: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Review: Radix Review: Radix

Using radix representation, we can group Using radix representation, we can group row shiftsrow shifts

We use radix 2 for simplicityWe use radix 2 for simplicity Digits are bit representation, Shift all row Digits are bit representation, Shift all row

indices have their k-th bit onindices have their k-th bit on

00

11

22

33

00

11

22

33

00

11

22

33

Shift for each row k=0 k=1

= +

Page 6: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

The concurrency pictureThe concurrency picture

Each thread can do pre/post processing Each thread can do pre/post processing independently independently

Processes must synchronize Processes must synchronize after each phaseafter each phase after each step of intra-process stepafter each step of intra-process step during intra-process communicationsduring intra-process communications

Page 7: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Communication package (1)Communication package (1)

We need a mean of communicationWe need a mean of communication Facilitates synchronized communicationFacilitates synchronized communication Provides unbuffered communication to save Provides unbuffered communication to save

memorymemory

JCSP: based on the algebra of JCSP: based on the algebra of Communicating Sequential Processes Communicating Sequential Processes ((CSPCSP) ) has strong theory backgroundhas strong theory background Object OrientedObject Oriented

Page 8: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Communication package (2)Communication package (2)

JCSP provides JCSP provides One2OneChannelOne2OneChannel

Where a single sender can send and a single Where a single sender can send and a single receiver can receivereceiver can receive

One2AnyChannelOne2AnyChannelWhere a single sender and many receiver can Where a single sender and many receiver can communicate but one at the same timecommunicate but one at the same time

Any2OneChannelAny2OneChannelMultiple senders and one receiver Multiple senders and one receiver

Page 9: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Classes (1)Classes (1)

CProcess: Column processCProcess: Column process Has a PID; Knows N; Has an array to save its Has a PID; Knows N; Has an array to save its

itemsitems One2OneChannel to each other process for One2OneChannel to each other process for

intra-process shift operationintra-process shift operation One2AnyChannel to MProcess to receive One2AnyChannel to MProcess to receive

start/resume callsstart/resume calls Any2OneChannel to MProcess to signal that Any2OneChannel to MProcess to signal that

this CProcess has finished current stepthis CProcess has finished current step

Page 10: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Classes (2)Classes (2)

MProcess: Master ProcessMProcess: Master Process One2Any Channel AnytoOneChannel to any One2Any Channel AnytoOneChannel to any

CProcessCProcess Synchronizes the phases and intra-process Synchronizes the phases and intra-process

communication by waiting for all CProcesses communication by waiting for all CProcesses to finish current phase and then resume them to finish current phase and then resume them for the next phasefor the next phase

Page 11: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Classes (3)Classes (3)

Launcher: Threads driverLauncher: Threads driver Create channelsCreate channels Create one MProcess and CProcessCreate one MProcess and CProcess Run them in parallelRun them in parallel

Page 12: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Intra-process communication in Intra-process communication in CProcessCProcess

Might send/receive multiple itemsMight send/receive multiple items Determines the indices that need to be shifted Determines the indices that need to be shifted Packs them in form of a messagePacks them in form of a message Sends the message to the next CProcess and Sends the message to the next CProcess and

receive from the previous process in the shift receive from the previous process in the shift chainchain

Unpack the received messageUnpack the received message Assign the items inside to the same indices Assign the items inside to the same indices

determined in the first stepdetermined in the first step

Page 13: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

UML DiagramUML Diagram

-PID : int-N : int-ch0 : One2OneChannel-ch1 : Any2OneChannel-ch2 : One2AnyChannel

CProcess

+run()

CSProcess

+In()+Out()

One2OneChannel

+In()+Out()

One2AnyChannel

+In()+Out()

Any2OneChannel

-N : int-ch1 : Any2OneChannel-ch2 : One2AnyChannel

MProcess -N : int-ch0 : One2OneChannel-ch1 : Any2OneChannel-ch2 : One2AnyChannel-CPs : CSProcess

Launcher

Page 14: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

The Intraprocess ShiftThe Intraprocess Shift

Synchronized send and then receiveSynchronized send and then receive

Cycle might formCycle might form

All CProcesses will go to send state and wait All CProcesses will go to send state and wait for the next CProcess to receivefor the next CProcess to receive

None of CSProcesses receive -> DeadlockNone of CSProcesses receive -> Deadlock

76543210 8

Page 15: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

The Shift Cycle (1)The Shift Cycle (1)

One CProcess in the cycle should One CProcess in the cycle should receive to break the cyclereceive to break the cycle

But will lose the value which has to But will lose the value which has to sendsend

Receives and buffers the send valueReceives and buffers the send value

Sends and then assign the buffered Sends and then assign the buffered value to the relevant array cellvalue to the relevant array cell

Page 16: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

The Shift Cycle (3)The Shift Cycle (3)

Cycles happen when the interleaving Cycles happen when the interleaving value h divides Nvalue h divides N

We do buffered read for all numbers less We do buffered read for all numbers less than hthan h

76543210 8

Page 17: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

The Shift Cycle (3)The Shift Cycle (3)

Even after this, the program runs into Even after this, the program runs into deadlock againdeadlock again

Cycles form when gcd(h, N) is greater Cycles form when gcd(h, N) is greater than 1than 1Must buffer values less than equal to Must buffer values less than equal to gcd(h, N)gcd(h, N)

76543210 8

Page 18: A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

ResultsResults