Example : parallelize a simple problem

Post on 04-Jul-2015

1.313 views 1 download


Dr. Mohammad Ansari http://uqu.edu.sa/staff/ar/4300205

Transcript of Example : parallelize a simple problem

Parallel & Distributed

Computer Systems

Dr. Mohammad Ansari

Course Details

Delivery◦ Lectures/discussions: English

◦ Assessments: English

◦ Ask questions in class if you don’t understand

◦ Email me after class if you do not want to ask in class


Assessments (this may change)◦ Homework (~1 per week): 10%

◦ Midterm: 20%

◦ 1 project + final exam OR 2 projects: 35%+35%

Course Details

Textbook◦ Principles of Parallel Programming, Lin & Snyder

Other sources of information:◦ COMP 322, Rice University

◦ CS 194, UC Berkeley

◦ Cilk lectures, MIT

Many sources of information on the

internet for writing parallelized code

Teaching Materials & Assignments

Everything is on Jusur◦ Lectures

◦ Homeworks

Submit homework through Jusur

Homework is given out on Saturday

Homework due following Saturday

You lose 10% for each day late

Homework 1

First homework is available on Jusur◦ Install Linux on your computer

It is needed for future homework

It is needed to access the supercomputers

◦ Check settings/hardware

Submit pictures of your settings

Submit description of your processor

◦ Deadline: 27/03/1431 (submit on Jusur)

Cheating in Homework/Projects

Cheating◦ If you cheat, you get zero

◦ If you help others cheat, you will also get zero

◦ Copy + paste from Internet, e.g. Wikipedia, or

elsewhere, is also cheating (called plagiarism)

◦ You can read any source of information, but you

must write answers in your own words

◦ If you have problems, please ask for help.


Previous lecture:◦ Why study parallel computing?

◦ Topics covered on this course

This lecture:◦ Example problem

Next week:◦ Parallel processor architectures

Example Problem

We will parallelize a simple problem

Begin to explore some of the issues

related to parallel programming, and

performance of parallel programs

Example Problem: Array Sum

Add all the numbers in a large array

It has 100 million elements

int size = 100000000;

int array[] = {7,3,15,10,13,18,6,4,…};

What code should we write for a

sequential program?

Example Problem: Sequential

int sum = 0;

int i = 0;

for(i = 0; i < size; i++) {

sum += array[i]; //sum=sum+array[i];


Example Problem: Sequential

How Do We Parallelize?

Objective: Thinking about parallelism◦ Multiple processors need something to do

A program/software has to be split into parts

Each part can be executed on a different processor.

◦ How do we improve performance over single processor?

If a problem takes 2 seconds on a single processor

And we break it into two (equal) parts: 1 second for each part

And we execute the two parts separately, but in parallel, on two processors, then we improve performance

How Do We Parallelize?

Part 0 Part 1

Sequential Parallel

Part 0

Part 1





How Do We Start Parallelizing?

What parts can be done separately?◦ What parts can we do on separate processors?

◦ Meaning: What parts have no data dependence

◦ Data dependence:

The execution of an instruction (or line of

code) is dependent on execution of a previous

instruction (or line of code).

◦ Data independence:

The execution of an instruction (or line of

code) is not dependent on execution of a

previous instruction (or line of code).

Example of Data Dependence

int x = 0;

int y = 5;

x = 3;

y = y + x; //Is this line dependent on

the previous line?

Data Dependence & Parallelism

In a sequential program, data dependence does not matter: each instruction executes in sequence. ◦ Instructions execute one by one

In a parallel program, data independence allows parallel execution of instructions. Data dependence prevents parallel execution of instructions.◦ Reduces parallel performance

◦ Reduces number of processors that can be used

Why is Data Dependence Bad For

Parallel Programs? Does not allow correct parallel execution


x = 3; y = 5; //(5 + 0)

x = 3; y = y + x;

Why is Data Dependence Bad For

Parallel Programs? Does not allow correct parallel execution


x = 3;

y = 8; //(5 + 3)

x = 3;

y = y + x;


Why is Data Dependence Bad For

Parallel Programs? Does not allow correct parallel execution


x = 3; y = 8;

x = 3;

y = y + x;

Example of Data Independence

int x = 0;

int y = 5;

x = 3;

y = y + 5; //Is this line dependent on

the previous line?

Why is Data Independence Useful?

Allows correct parallel execution


x = 3; y = 10;

x = 3; y = y + 5;

Back to Array Sum Example

Does the code have data dependence?

int sum = 0;

for(int i = 0; i < size; i++) {

sum += array[i]; //sum=sum+array[i];


Back to Array Sum Example

Does the code have data dependence?

int sum = 0;

for(int i = 0; i < size; i++) {

sum += array[i]; //sum=sum+array[i];


Not so easy to see

Back to Array Sum Example

Let’s unroll the loop:

int sum = 0;sum += array[0]; //sum=sum+array[0];sum += array[1]; //sum=sum+array[1];sum += array[2]; //sum=sum+array[2];sum += array[3]; //sum=sum+array[3];…

Now we can see dependence!

Example Problem: Sequential

Removing Dependencies

Sometimes this is possible.◦ Dependencies discussed in detail later.

Tip: Can be useful to look at the

problem being solved by the

code, and not the code itself.

Break Sum into Pieces

7 3 1 0 2 9 5 8 3 6



P0 P1

Some Details…

A program executes inside a process

If we want to use multiple processors◦ We need multiple processes

◦ One process for each processor (not fixed rule)

Processes are big, heavyweight

Threads are lighter than processes◦ But same strategy

◦ One thread for each processor (not fixed rule)

We will talk about threads and processes later, if necessary

What Does the Code Look Like?

int numThreads = 2; //Assume one thread per core, and 2 cores

int sum = 0;

int i = 0;

int middleSum[numThreads];

int threadSetSize = size/numThreads

//Each thread will execute this code with a different threadID

for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)


middleSum[threadID] += array[i];


//Only thread 0 will execute this code

if (threadID==0) {

for(i = 0; i < numThreads; i++) {

sum += middleSum[i];



Load Balancing

Which processor is doing more work?

7 3 1 0 2 9 5 8 3 6



P0 P1

Load Balancing

Part 0

Part 1

Sequential Parallel

Part 0

Part 1

P0 P0 P1





Example Problem: Array Sum

Parallelized code is more complex

Requires us to think differently about

how to solve the problem◦ Need to think about breaking it into parts

◦ Analyze data dependencies, remove if possible

◦ Need to load balance for better performance

Example Problem: Array Sum

However, the parallel code is broken◦ Thread 0 adds all the middle sums.

◦ What if thread 0 finishes its own work, but

other threads have not?


P0 will probably finish before P1

7 3 1 0 2 9 5 8 3 6



P0 P1

How Can We Fix The Code to

GUARANTEE It Works Correctly?int numThreads = 2; //Assume one thread per core, and 2 cores

int sum = 0;

int i = 0;

int middleSum[numThreads];

int threadSetSize = size/numThreads

//Each thread will execute this code with a different threadID

for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)


middleSum[threadID] += array[i];


//Only thread 0 will execute this code

if (threadID==0) {

for(i = 0; i < numThreads; i++) {

sum += middleSum[i];




Sometimes we need to

coordinate/organize threads

If we don’t, the code might calculate the

wrong answer to the problem

Can happen even if load balance is perfect

Synchronization is concerned with this

coordination / organization

Code with Synchronization Fixed

int numThreads = 2; //Assume one thread per core, & 2 cores

int sum = 0;

int i = 0;

int middleSum[numThreads];

int threadSetSize = size/numThreads

//Each thread will execute this code with a different threadID

for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)


middleSum[threadID] += array[i];


waitForAllThreads(); //Wait for all threads

//Only thread 0 will execute this code

if (threadID==0) {

for(i = 0; i < numThreads; i++) {

sum += middleSum[i];




The example shows a barrier

This is one type of synchronization

Barriers require all threads to reach

that point in the code, before any

thread is allowed to continue

It is like a gate. All threads come to

the gate, and then it opens.

Generalizing the Solution

We only looked at how to parallelize

for 2 threads

But the code is more general◦ Can use any number of threads

◦ Important that code is written this way

◦ We will look at this in more detail later

Parallel Program


Now the program is correct Let’s look at performance







1 Thread 2 Threads 4 Threads

Time on 2-core Processor


Two-threads are not 2x fast. Why?

◦ The problem is called false sharing

◦ To understand this, we have to look at the

computer architecture

◦ We will study this in the next lecture

Four-threads slower than two-threads.


◦ The processor only has two cores

◦ Four threads adds scheduling overhead, wastes



Used an example to start looking at

how to parallelize code, and some of

the main issues◦ Data dependence

◦ Load balancing

◦ Synchronization

Each will be discussed in more detail

in later lectures