Download - Parallel Programming Course Overview of parallel ......Parallel Programming Course Overview of parallel development with OpenMP Paul Guermonprez [email protected]

Parallel Programming CourseOverview of parallel development

with OpenMP

Paul Guermonprezwww.Intel-Software-Academic-Program.com

[email protected] Software

2012-03-14

Computing Pi in serial

Pi serial code

double x, pi, sum=0.0;int i;step = 1./(double)num_steps;start = clock();

for (i=0; i<num_steps; i++) {x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);

}

pi = sum*step;stop = clock();

printf("The value of PI is %15.12f\n",pi);printf("The time to calculate PI was %f seconds\n",((double)(stop - start)/1000.0));

Compilation - MS

Open the Visual Studio 2010 Solution file :C:\IAP_parallel_course\Pi example VS2010 solution\Pi.sln

Build (compile) : F7You should see “Build: 1 succeeded” at the bottom

Open VS command prompt : All Programs>MS VS2010>VS Toolscd /cd IAP_parallel_coursecd "Pi example VS2010 solution"cd Picd Debugdir

You should seePi.exe

Compilation - Linux

# 1, Go to the lab folder :cd “Pi example Linux-CLI”

# 2. Compile with gcc :gcc -O0 -g pi.c -oserial.bin

Serial execution - MS

Type Pi in the prompt to run the binary.

If you open the task manager in processes view and sort by CPU,Pi should be at the top with 50%. I also added the threads column.

50% on windows means the execution is using one core on my dual core machine. I am using one core because I have one thread running.

On my machine, runtime is 18 seconds

Execution - Linux

# 1. Execute with time and note the “real” time :time ./serial.bin# 2. run top on a different terminaltop# 3. Optional : Check the libraries used by the binaryldd serial.bin

linux-gate.so.1 => (0x00ceb000)libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0x00711000)libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0x00f89000)libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0x004b4000)libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0x00119000)/lib/ld-linux.so.2 (0x00c21000)

# Note : on linux, if you totally use one core, cpu usage is 100%.

Let's think parallel

Goal

Situation : The current version is using one core, taking 18 seconds to compute.

Goal : We would like to use two cores to go faster, if possible around 9 seconds.

Analysis : Our code is a long, CPU intensive loop so we'd like something to split the iterations on different threads.

Solution : We'll try OpenMP.

OpenMP ?

OpenMP is an open standard : OpenMP.org● You add magic “pragma” comments

to your code. No need to change the C/C++.● You compile with an OpenMP aware compiler● Your binary will execute in parallel !

It's a simple, clean and well known technology in both research and industry.

Pi with OpenMP pragmas

Serial version :for (i=0; i<num_steps; i++) {

x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);

}

Parallel version (incomplete and buggy, we'll see why later)with OpenMP pragmas :#pragma omp parallel forfor (i=0; i<num_steps; i++) {

x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);

}

Add the OpenMP pragma lineto your code.

OpenMP pragma detail

#pragma omp parallel for● # : it's a simple comment line,

my C/C++ code is unchanged.● pragma : this comment is special,

(compiler please read this line)● omp : read it as an OpenMP pragma (compiler,

if you are not OpenMP aware, do not read it)● parallel : it's a parallel region● for : the kind of parallel region is a for loop

Parallel version

Switch to Intel Compiler - MS

Most modern compilers support OpenMP(Intel Compiler, MS VS Compiler, GNU GCC, …).

But I've installed Intel Parallel Studio on top of MS Visual Studio, and I'd like to use it :

● Right click on the Pi project (under Solution Pi)

● Menu “Intel C++ Composer XE 2011 > Use Intel C++”

● ReBuild : F7

● You should see the message“Building with Intel(R) C++ Compiler XE 12.0”

Switch to Intel Compiler - Linux

# 1. Set the intel environment variables :source /opt/intel/bin/iccvars.sh intel64# or source /opt/intel/bin/iccvars.sh ia32

# 2. Check icc versionicc -v

# 3. ReCompile with icc :icc -O0 -g pi.c -oserial.bin

# 4. Or use Make instead of icc to clean and build :make

Enable OpenMP and ReBuild - MS

OpenMP is not enabled by default.

We need to change a compiler flag :

● Right click on the Project Pi

● Menu : “Configuration Properties >C/C++ > Language (Intel C++) > OpenMP Support”

● Switch from “No” to “Generate Parallel Code”

● If you are compiling from command line,it's /Qopenmp on windows or -openmp on linux.

● OK to validate changes

● ReBuild : F7

Enable OpenMP and ReBuild - Linux

With icc, use the flag -openmp and -fopenmp with gcc.

Edit the Makefile included to change the line corresponding to the serial compilation from :

serial: pi.c ${CC} -O0 ${CFLAGS} pi.c -oserial.bin

To :

serial: pi.c ${CC} -openmp ${CFLAGS} pi.c -oserial.bin

Then :

make

Parallel execution - MSType Pi in the prompt to run the binary.

If you open the task manager in processes view and sort by CPU,Pi should be at the top with 100% instead of 50% for the serial run,and run with 3 threads instead of 1 (on my dual-core machine). ;-)

100% means the execution is using all the cores on my machine. ;-))I am using all the cores because I have enough threads running.

Unfortunately, the result is wrong, and varying a lot from run to run. ;-(

Parallel execution - Linux

Type time ./serial.bin in the prompt to run the binary.

If you open top, Pi binary should be at the top with nearly 200% instead of 100% for the serial run. ;-)

200% means the execution is using the two cores on my machine. ;-))I am using all the cores because I have enough threads running.Note : on windows 100% means all cores. On linux 100% means 1 core.

Unfortunately, the result is wrong, and varying a lot from run to run. ;-(

Analyze the problem

Source vs Runtime analysis

From the previous presentation, we are familiar with the concept of race condition :The wrong Pi results should not be a surprise.

We introduced parallelism in our code without protecting the shared variables !

We could analyze the code, but today we'll try to use the Intel Parallel Inspector tool to characterize the parallel bug with a runtime analysis.

Runtime analysis

Parallel Inspector is analyzing all memory accesses during runtime to detect potential race conditions, even if they don't actually occur.

Criteria #1 : run in parallel (check task manager).

Criteria #2 : select a short benchmark, as an instrumented run is slower than a regular run.Edit pi.c and set num _steps to 1000000.

Runtime analysis - MS

Click on “Tools>Intel Inspector XE 2011> Threading Error Analysis/Locate Deadlocks and Data Races” and wait.

Parallel Inspector Result - GUI

As expected : I have a Data Racefor the variable sum.

As suspected : I have a data racefor the variables sum and x.

Runtime analysis - CLI

# 1. Setup Intel inspector environment variables :source /opt/intel/inspector_xe_2011/inspxe-vars.sh

# 2. run “ti3” analysis on the serial.bin app :inspxe-cl -collect ti3 -- ./serial.bin

The value of PI is 3.141164685199The time to calculate PI was 1.050 seconds

Warning: One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application. Setting the Intel(R) Inspector XE 2011 to detect data races on stack accesses and running another analysis may help you locate these and other bugs.

1 new problem(s) found 1 Data race problem(s) detected

Parallel Inspector Result - CLI

# run “report problems” inspxe-cl -report problems

Problem P1: Error: Data racepi.c(17): Error X1: P1: Data race: Write: Function main: Module serial.binpi.c(18): Error X2: P1: Data race: Read: Function main: Module serial.binpi.c(18): Error X3: P1: Data race: Write: Function main: Module serial.binpi.c(17): Error X4: P1: Data race: Write: Function main: Module serial.binpi.c(18): Error X5: P1: Data race: Write: Function main: Module serial.bin

Explanation

double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel forfor (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region

Master Thread Worker Thread #1

Worker Thread #2

Worker Thread #3

Worker Thread #4

Fork Join

Sequential Parallel Sequential

x and sum are createdbefore the #pragma line,during the serial part of the software.

But they are used (read and write) during the parallel region by all threads.

Solve the problem

Solution planning

double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel forfor (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region

Two cases :● x is only used locally during an iteration to memorize a value

from the 1rst line to the 2nd line. We do not need to share x.Declaring x as a local variable for each iteration will limit the scope of the variable and solve the data race.

● sum has to be shared because we used the aggregated result after the parallel region to display the final result.We need to protect the variable access.

OpenMP solution

double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel for private(x) reduction(+:sum)for (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region

Two solutions for two different problems :● x : Instead of changing the C++ code to declare x locally,

I add a private(x) to the pragma line.● sum : I add reduction(+:sum) to the pragma line.

(To learn OpenMP in detail, check the next course)

Check the solution

First edit the file to add “private(x) reduction(+:sum)”then rebuild the binary (make or rebuilt solution button)

and rerun Inspector analysis : 0 problem detected !

Correctness : It works perfectly now,and it wasn't that complicated : 1 line of pragmas only !

Performance : For 1.000.000.000 stepson my dual-corecomputer,

runtime is 19.98s instead of 37.33s (1.87x faster)

License Creative Commons - By 2.5

You are free:to Share — to copy, distribute and transmit the workto Remix — to adapt the workto make commercial use of the work

Under the following conditions:Attribution — You must attribute the work in the manner specified by the author or

licensor (but not in any way that suggests that they endorse you or your use of the work).

With the understanding that:Waiver — Any of the above conditions can be waived if you get permission from the

copyright holder.Public Domain — Where the work or any of its elements is in the public domain under

applicable law, that status is in no way affected by the license.Other Rights — In no way are any of the following rights affected by the license:o Your fair dealing or fair use rights, or other applicable copyright exceptions and

limitations;o The author's moral rights;o Rights other persons may have either in the work itself or in how the work is used,

such as publicity or privacy rights.Notice — For any reuse or distribution, you must make clear to others the license

terms of this work. The best way to do this is with a link to the web page http://creativecommons.org/licenses/by/2.5/