Parallelizing Iterative Computation for Multiprocessor Architectures

Peter Cappello

What is the problem?

Create programs for multi-processor unit (MPU)

– Multicore processors

– Graphics processing units (GPU)

For whom is it a problem? Compiler designer

ApplicationProgram Compiler Executable

HARDER

MUCH HARDER

For whom is it a problem? Application programmer

Complex Machine Consequences

• Programmer needs to be highly skilled

• Programming is error-prone

These consequences imply . . .

Increased parallelism increased development cost!

Amdahl’s Law

The speedup of a program is bounded by its inherently sequential part.

(http://en.wikipedia.org/wiki/Amdahl's_law)

If– A program needs 20 hours using a CPU– 1 hour cannot be parallelized

Then– Minimum execution time ≥ 1 hour.– Maximum speed up ≤ 20.

9(http://en.wikipedia.org/wiki/Amdahl's_law)

Parallelization opportunities

Scalable parallelism resides in 2

sequential program constructs:

• Divide-and-conquer recursion

• Iterative statements (for)

2 schools of thought

• Create a general solution

(Address everything somewhat well)

• Create a specific solution

(Address one thing very well)

Focus on iterative statements (for)

float[] x = new float[n];

float[] b = new float[n];

float[][] a = new float[n][n];

for ( int i = 0; i < n; i++ )

b[i] = 0;

for ( int j = 0; j < n; j++ )

b[i] += a[i][j]*x[j];

Matrix-Vector Product

b = Ax, illustrated with a 3X3 matrix, A.

_______________________________

b1 = a11*x1 + a12*x2 + a13*x3

b2 = a21*x1 + a22*x2 + a23*x3

b3 = a31*x1 + a32*x2 + a33*x3

a31 a32 a33

a21 a22 a23

a11 a12 a13

x1 x2 x3

a31 a32 a33

a21 a22 a23

a11 a12 a13

x1 x2 x3

a31 a32 a33

a21 a22 a23

a11 a12 a13

x1 x2 x3

Matrix Product

C = AB, illustrated with a 2X2 matrices.

c11 = a11*b11 + a12*b21

c12 = a11*b12 + a12*b22

c21 = a21*b11 + a22*b21

c12 = a21*b12 + a22*b22

a21 a22

a11 a12

b11 b21

a21 a22

a11 a12b12

a21a22

b11 b21

a21 a22

a11 a12b12

a21 a22

a11 a12

b11 b21

Sa21 a22

a11 a12b12

Declaring an iterative computation

• Index set

• Data network

• Functions

• Space-time embedding

Declaring an Index set

I1: I2:1 ≤ i ≤ j ≤ n 1 ≤ i ≤ n 1 ≤ j ≤ n

Declaring a Data network

x: [ -1, 0];

b: [ 0, -1];

a: [ 0, 0];

x: [ -1, 0];

b: [ -1, -1];

a: [ 0, -1];

x: [ -1, 0];

b: [ 0, -1];

a: [ 0, 0];

Declaring an Index set + Data network

1 ≤ i ≤ j ≤ n

Declaring the Functions

R1:float x’ (float x) { return x; }

float b’ (float b, float x, float a)

{ return b + a*x; }

R2:char x’ (char x) { return x; }

boolean b’ (boolean b, char x, char a)

{ return b && a == x; }i

Declaring a Spacetime embedding

E1:– space = -i + j– time = i + j.

E2:– space1 = i – space2 = j– time = i + j.

timespace2

space1

Declaring an iterative computation Upper triangular matrix-vector product

UTMVP = (I1,D1,F1,E1)

Declaring an iterative computation Full matrix-vector product

Declaring an iterative computation Convolution (polynomial product)

Declaring an iterative computation String pattern matching

Declaring an iterative computation Pipelined String pattern matching

timespace2

space1

Iterative computation specification

Declarative specification

Is a 4-dimensional design space

(actually 5 dimensional: space embedding is

independent of time embeding)

Facilitates reuse of design components.

Starting with an existing language …

• Can infer

– Index set

– Data network

– Functions

• Cannot infer

– Space embedding

– Time embedding

Spacetime embedding

• Start with it as a program annotation

• More advanced:

compiler optimized based on program

annotated figure of merit.

• Work out details of notation• Implement in Java, C, Matlab, HDL, …• Map virtual processor network to actual processor

network• Map

– Java: map processors to Threads, [links to Channels]– GPU: map processors to GPU processing elements

Challenge: spacetime embedding depends on underlying architecture

Work …

• The output of 1 iterative computation is

the input to another.

• Develop a notation for specifying a

composite iterative computation?

Thanks for listening!

Questions?

Parallelizing Iterative Computation for Multiprocessor Architectures

Documents

Transcript of Parallelizing Iterative Computation for Multiprocessor Architectures

PARALLELIZING PATH EXPLORATION AND OPTIMIZING …

Parallelizing the Data Cube

Parallelizing Compilers Presented by Yiwei Zhang.

Parallelizing Spacetime Discontinuous Galerkin Methods

OSCAR Parallelizing CompilerOSCAR Parallelizing …...OSCAR Parallelizing CompilerOSCAR Parallelizing Compiler and API for Low Powerand API for Low Power High Performance Multicores

Parallelizing stencil computations

Lecture 11 - parallelizing Compilers · Lecture 11 Parallelizing Compilers. Prof. Saman Amarasinghe, MIT. 2 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence

Parallelizing CI using Docker Swarm-Mode

Experiments with Parallelizing Tribology Simulations

Parallelizing the Execution of Sequential Scripts

Bit Multiprocessor

Adaptively Parallelizing Distributed Range Queries

Parallelizing Queue Compiler

Parallelizing Iterative Computation for Multiprocessor Architectures Peter Cappello.

Parallelizing Programs

Multiprocessor architectures

Multiprocessor Initialization

Parallelizing Existing R Packages

Parallelizing Live Migration of Virtual Machines

Chapter 15 – Multiprocessor Management - Virginia Techcourses.cs.vt.edu/~cs3204/spring2004/Notes/OS3e_15.pdf15.6 Multiprocessor Scheduling 15.6.1 Job-Blind Multiprocessor Scheduling