Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to...

Decomposition • Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors Example: Embarrassingly parallel applications • Functional Decomposition – Dividing an algorithm into its functional pieces and executing the pieces in separate processors Example: Pipelining

Transcript of Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to...

Page 1: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.


• Data Decomposition– Dividing the data into subgroups and assigning

each piece to different processors– Example: Embarrassingly parallel applications

• Functional Decomposition– Dividing an algorithm into its functional pieces and

executing the pieces in separate processors– Example: Pipelining

Page 2: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Pipelined Computations

• Divide a problem into a series of tasks• A processor completes a task sequentially and

pipes the results to the next processor

Example of Summing Groups of Numbers

P0 P1 P4P2 P3 P5

P0 P1 P4P2 P3 P5

∑A[i0] ∑A[i1] ∑A[i2] ∑A[i3] ∑A[i4] ∑A[i5]

zero total

Question: Is this data or is it functional decomposition?

Page 3: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Where is Pipelining Applicable?

Type 1 – More than one instance of a problem – Example: Multiple simulations with different parameter settings

Type 2– Series of data items with multiple operations– Example: Signal Filter or Eratosthenes Sieve

Type 3– Partial results passed on while processing continues– Example: Solving sets of linear equations

Considerations– Are there a series of sequential tasks?– Is the processing of each tack approximately equal?– Can items be grouped to minimize communication cost– If stages exceed processors

o Group stageso Wrap last stage back to the first

– Determine where the result will be at the end of the process

Page 4: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Summing Numbers Example

process Pi>0 && <N-1

recv(&sum, Pi-1);

sum += number;send(&sum, Pi+1);

Process P0

send(&number, P1);

Process PN-1

recv(&number, Pn-2);

sum += number;Save or display result

Page 5: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Application• Remove frequencies from a signal

– Sequential Algorithm: Fourier Analysis (O(N lg(N))– Parallel: Apply filters to the signal (O(N*FilterLength)) with convolution.

– Filter Examples: Chebyshev, ButtorWorth, etc.– Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation.– Filters can be useful to manipulate signals, detect patterns, etc.

Page 6: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Chebyshev Filter DesignChebyshev in the z-domain Chebyshev Frequency Response

Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently

Page 7: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Type 1: Multiple Instances

Sequential execution: t1 = m*tm

Parallel Processing: (m + p – 1)*tm/p

Parallel Communication: (m+p-1)*(tstart+n*tdata)

Speed up: tp= m*tm/((m+p-1)*(tm/p+tstart+n*tdata))

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

Instance 1

Instance 2

Instance 3Instance 4Instance 5


Space Time Diagram

Notation1. m = instances, p = processors2. tstart = latency tdata = bandwidth3. n = data transmitted /instance4. tm = total time to process an

instance5. Total pipeline cycles = m + p – 16. Assume: Equal processing per stage

Page 8: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Type 2: Multiple Data Elements

P0 P1 P4P2 P3 P5

Filter f0



Filter f1

Filter f2

Filter f3

Filter f4

Filter f5

d9d8d7d6d5d4d3d2d1d0 P0 P0 P0 P0 P0 P0

Example: Signal FilterEach process removes one or more frequencies from a digitized signal

Page 9: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Type 2 Timing Diagram

Page 10: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Type 3: Partial Processing• The next stage receives information to continue processing• Additional processing continues at the source processor

Question: How do we determine speed-up?













Linear Equations A More Balanced Load

= Idle

= Executing

Page 11: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Operation at each processorTypes 1 and 2

• Processor with rank r = 0– Generate the instance (type 1) or the data (type 2) to process– Process appropriately– Send message to the processor with rank 1

• Processors with rank r = 1, 2, p-2– Receive message from the processor with rank r-1– Process appropriately– Send message to the processor with rank r+1

• Processor with rank r = p-1– Receive message from processor with rank r-1– Process appropriately– Output final results Examples

1) Adding Numbers: n1 -> n1+n2 -> n1+n2+n3 -> . . .2) Frequency removal: f(t) -> f0; f(t-f0)-> f1; f(t-f0-f1)-

> . . .

Page 12: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Parallel Pipeline


5 4 3 2 1

5 4 3 2

5 4 3 1

5 4 2

5 3 1

5 2

5 2



Step Numbers P0 P1 P2 P3 P4

4, 3, 1, 2, 5

4, 3, 1, 2

4, 3, 1

4, 3






















• Pseudo code

Receive xi

IF xi < xmax

Send xi


Send xmax

xmax = xi

Note: Processors can hold blocks of numbers for better efficiency

Page 13: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Bi-Directional Pipeline• Use the pipeline to return results to the master

– Useful for line topologies, ring, or hypercube

P0 P1 P4P2 P3 P5

Sorting Phase





P0 Time

Gather Phase

Phases•N(generate steps); •N-1 (propagate steps); •N-1 (return steps) = 3N-2

• Sort PhaseIf (myid == 0) generate number Else receive(&number, pmyid-1)If (number > max and myid<P-1){ send(max,pmyid+1); maximuSoFar=number;}

• Gather phaseIf (myid < P-1) receive sorted numbers from pmyid+1

If (myid > 0) send sorted numbers to pmyid-1

Example: Sorting

Page 14: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Sieve of Eratosthenes

Page 15: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Prime Number GenerationSieve of Eratosthenes (Type 2 pipeline)

• Concept– Each processor filters blocks of non-primes from the flow of data– The “potential” prime numbers pass through to the next

processor• Pseudo-code

The Master processor generates an array of odd n numbersIn a loop after receiving a group of numbers

Filter a group of numbers; pass unfiltered numbers down the pipelineGather all of the primes

• Notes– Wrapping the pipeline in a ring could help maintain load balance– A termination message determines when the pipeline empties

Question: What range of numbers should each processor get?

Page 16: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Sequential codefor (i = 2; i < n; i++)

prime[i] = 1; for (i = 2; i <= sqrt_n; i++)

if (prime[i] == 1)for (j = i + i; j < n; j = j + i)

prime[j] = 0

Parallel CodeProcessor pi > 0Recv(number, rank-1);PRIME = TRUE;FOR (int x=MIN; x<MAX; x+=MIN)

IF ((number % x) == 0)PRIME = FALSE and BREAK

IF (PRIME) send(number, rank+1);

Terminationrecv(number, rank-1);send(number, rank+1)IF (number == terminator) break;

Sequential Time O(n2)


Page 17: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Upper Triangular Matrix

All entries below the diagonal are zeroUseful for solving N equations and N unknowns

Page 18: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Solving Sets of Linear Equations

• Upper Triangular Forman-1,0x0 + an-1,1x1 + … + an-1,n-1xn-1 = bn-1

an-2,0x0 + an-2,1x1 + … + an-2,n-2xn-1 = bn-2

a1, 0x0 + a1,1x1 = b1

a0,0x0 = b0

• Back Substitutionx0=b0/a0,0



• Parallel code for pi where 1<=i<nsum = 0For (j=0; j<i; j++){ receive(&x[j], pi-1); sum += ai,j * xj;

send(xj,pi+1)}xi = (bi – sum)/ai,i

• General solution for xixi= (bi – ∑j=0 to i-1 ai,j xj)/ai,I

• Sequential codex[0] = b0/a0,0, FOR (i=1; i<n; i++)

sum=0;FOR (j=0; j<i; j++)

sum += ai,I xjxi= (bi – sum)/ai,I

• Parallel Pseudo codefor (j = 0; j < i; j++)

recv(x[j], p-1); send(x[j], p+1);sum = 0;for (j = 0; j < i; j++)

sum = sum + a[i][j]*x[j]x[i] = (b[i] - sum)/a[i][i];send(x[i], p+1);

This is a type 3 pipeline example

Note: ai,j and bi are constants

Page 19: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Pipeline Solution


IF p ≠ master, receive xj from previous processor

IF p ≠ P-1, send xj to next processor

back substitute xj

UNTIL xi evaluated

IF p ≠ P-1send xi to the next processor

• Notes:1. Processing continues after sending values

down the pipeline2. Is the load imbalanced?

Page 20: Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Illustration of Type 3 Solution

Compute x0 Compute x1 Compute x2 Compute x3











P0 P1 P2 P3








How balanced isThis load?