Parallel Programming CourseOverview of parallel development
with OpenMP
Paul Guermonprezwww.Intel-Software-Academic-Program.com
[email protected] Software
2012-03-14
Computing Pi in serial
Pi serial code
double x, pi, sum=0.0;int i;step = 1./(double)num_steps;start = clock();
for (i=0; i<num_steps; i++) {x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);
}
pi = sum*step;stop = clock();
printf("The value of PI is %15.12f\n",pi);printf("The time to calculate PI was %f seconds\n",((double)(stop - start)/1000.0));
Compilation - MS
Open the Visual Studio 2010 Solution file :C:\IAP_parallel_course\Pi example VS2010 solution\Pi.sln
Build (compile) : F7You should see “Build: 1 succeeded” at the bottom
Open VS command prompt : All Programs>MS VS2010>VS Toolscd /cd IAP_parallel_coursecd "Pi example VS2010 solution"cd Picd Debugdir
You should seePi.exe
Compilation - Linux
# 1, Go to the lab folder :cd “Pi example Linux-CLI”
# 2. Compile with gcc :gcc -O0 -g pi.c -oserial.bin
Serial execution - MS
Type Pi in the prompt to run the binary.
If you open the task manager in processes view and sort by CPU,Pi should be at the top with 50%. I also added the threads column.
50% on windows means the execution is using one core on my dual core machine. I am using one core because I have one thread running.
On my machine, runtime is 18 seconds
Execution - Linux
# 1. Execute with time and note the “real” time :time ./serial.bin# 2. run top on a different terminaltop# 3. Optional : Check the libraries used by the binaryldd serial.bin
linux-gate.so.1 => (0x00ceb000)libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0x00711000)libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0x00f89000)libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0x004b4000)libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0x00119000)/lib/ld-linux.so.2 (0x00c21000)
# Note : on linux, if you totally use one core, cpu usage is 100%.
Let's think parallel
Goal
Situation : The current version is using one core, taking 18 seconds to compute.
Goal : We would like to use two cores to go faster, if possible around 9 seconds.
Analysis : Our code is a long, CPU intensive loop so we'd like something to split the iterations on different threads.
Solution : We'll try OpenMP.
OpenMP ?
OpenMP is an open standard : OpenMP.org● You add magic “pragma” comments
to your code. No need to change the C/C++.● You compile with an OpenMP aware compiler● Your binary will execute in parallel !
It's a simple, clean and well known technology in both research and industry.
Pi with OpenMP pragmas
Serial version :for (i=0; i<num_steps; i++) {
x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);
}
Parallel version (incomplete and buggy, we'll see why later)with OpenMP pragmas :#pragma omp parallel forfor (i=0; i<num_steps; i++) {
x = (i + .5)*step;sum = sum + 4.0/(1.+ x*x);
}
Add the OpenMP pragma lineto your code.
OpenMP pragma detail
#pragma omp parallel for● # : it's a simple comment line,
my C/C++ code is unchanged.● pragma : this comment is special,
(compiler please read this line)● omp : read it as an OpenMP pragma (compiler,
if you are not OpenMP aware, do not read it)● parallel : it's a parallel region● for : the kind of parallel region is a for loop
Parallel version
Switch to Intel Compiler - MS
Most modern compilers support OpenMP(Intel Compiler, MS VS Compiler, GNU GCC, …).
But I've installed Intel Parallel Studio on top of MS Visual Studio, and I'd like to use it :
● Right click on the Pi project (under Solution Pi)
● Menu “Intel C++ Composer XE 2011 > Use Intel C++”
● ReBuild : F7
● You should see the message“Building with Intel(R) C++ Compiler XE 12.0”
Switch to Intel Compiler - Linux
# 1. Set the intel environment variables :source /opt/intel/bin/iccvars.sh intel64# or source /opt/intel/bin/iccvars.sh ia32
# 2. Check icc versionicc -v
# 3. ReCompile with icc :icc -O0 -g pi.c -oserial.bin
# 4. Or use Make instead of icc to clean and build :make
Enable OpenMP and ReBuild - MS
OpenMP is not enabled by default.
We need to change a compiler flag :
● Right click on the Project Pi
● Menu : “Configuration Properties >C/C++ > Language (Intel C++) > OpenMP Support”
● Switch from “No” to “Generate Parallel Code”
● If you are compiling from command line,it's /Qopenmp on windows or -openmp on linux.
● OK to validate changes
● ReBuild : F7
Enable OpenMP and ReBuild - Linux
With icc, use the flag -openmp and -fopenmp with gcc.
Edit the Makefile included to change the line corresponding to the serial compilation from :
serial: pi.c ${CC} -O0 ${CFLAGS} pi.c -oserial.bin
To :
serial: pi.c ${CC} -openmp ${CFLAGS} pi.c -oserial.bin
Then :
make
Parallel execution - MSType Pi in the prompt to run the binary.
If you open the task manager in processes view and sort by CPU,Pi should be at the top with 100% instead of 50% for the serial run,and run with 3 threads instead of 1 (on my dual-core machine). ;-)
100% means the execution is using all the cores on my machine. ;-))I am using all the cores because I have enough threads running.
Unfortunately, the result is wrong, and varying a lot from run to run. ;-(
Parallel execution - Linux
Type time ./serial.bin in the prompt to run the binary.
If you open top, Pi binary should be at the top with nearly 200% instead of 100% for the serial run. ;-)
200% means the execution is using the two cores on my machine. ;-))I am using all the cores because I have enough threads running.Note : on windows 100% means all cores. On linux 100% means 1 core.
Unfortunately, the result is wrong, and varying a lot from run to run. ;-(
Analyze the problem
Source vs Runtime analysis
From the previous presentation, we are familiar with the concept of race condition :The wrong Pi results should not be a surprise.
We introduced parallelism in our code without protecting the shared variables !
We could analyze the code, but today we'll try to use the Intel Parallel Inspector tool to characterize the parallel bug with a runtime analysis.
Runtime analysis
Parallel Inspector is analyzing all memory accesses during runtime to detect potential race conditions, even if they don't actually occur.
Criteria #1 : run in parallel (check task manager).
Criteria #2 : select a short benchmark, as an instrumented run is slower than a regular run.Edit pi.c and set num _steps to 1000000.
Runtime analysis - MS
Click on “Tools>Intel Inspector XE 2011> Threading Error Analysis/Locate Deadlocks and Data Races” and wait.
Parallel Inspector Result - GUI
As expected : I have a Data Racefor the variable sum.
As suspected : I have a data racefor the variables sum and x.
Runtime analysis - CLI
# 1. Setup Intel inspector environment variables :source /opt/intel/inspector_xe_2011/inspxe-vars.sh
# 2. run “ti3” analysis on the serial.bin app :inspxe-cl -collect ti3 -- ./serial.bin
The value of PI is 3.141164685199The time to calculate PI was 1.050 seconds
Warning: One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application. Setting the Intel(R) Inspector XE 2011 to detect data races on stack accesses and running another analysis may help you locate these and other bugs.
1 new problem(s) found 1 Data race problem(s) detected
Parallel Inspector Result - CLI
# run “report problems” inspxe-cl -report problems
Problem P1: Error: Data racepi.c(17): Error X1: P1: Data race: Write: Function main: Module serial.binpi.c(18): Error X2: P1: Data race: Read: Function main: Module serial.binpi.c(18): Error X3: P1: Data race: Write: Function main: Module serial.binpi.c(17): Error X4: P1: Data race: Write: Function main: Module serial.binpi.c(18): Error X5: P1: Data race: Write: Function main: Module serial.bin
Explanation
double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel forfor (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region
Master Thread Worker Thread #1
Worker Thread #2
Worker Thread #3
Worker Thread #4
Fork Join
Sequential Parallel Sequential
x and sum are createdbefore the #pragma line,during the serial part of the software.
But they are used (read and write) during the parallel region by all threads.
Solve the problem
Solution planning
double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel forfor (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region
Two cases :● x is only used locally during an iteration to memorize a value
from the 1rst line to the 2nd line. We do not need to share x.Declaring x as a local variable for each iteration will limit the scope of the variable and solve the data race.
● sum has to be shared because we used the aggregated result after the parallel region to display the final result.We need to protect the variable access.
OpenMP solution
double x, pi, sum=0.0; # we are in sequential mode, master thread#pragma omp parallel for private(x) reduction(+:sum)for (i=0; i<num_steps; i++) # beginning of the parallel region{ x = (i + .5)*step; # executed in parallel in each worker thread sum = sum + 4.0/(1.+ x*x); # executed in parallel in each worker thread} # end of the parallel region
Two solutions for two different problems :● x : Instead of changing the C++ code to declare x locally,
I add a private(x) to the pragma line.● sum : I add reduction(+:sum) to the pragma line.
(To learn OpenMP in detail, check the next course)
Check the solution
First edit the file to add “private(x) reduction(+:sum)”then rebuild the binary (make or rebuilt solution button)
and rerun Inspector analysis : 0 problem detected !
Correctness : It works perfectly now,and it wasn't that complicated : 1 line of pragmas only !
Performance : For 1.000.000.000 stepson my dual-corecomputer,
runtime is 19.98s instead of 37.33s (1.87x faster)
License Creative Commons - By 2.5
You are free:to Share — to copy, distribute and transmit the workto Remix — to adapt the workto make commercial use of the work
Under the following conditions:Attribution — You must attribute the work in the manner specified by the author or
licensor (but not in any way that suggests that they endorse you or your use of the work).
With the understanding that:Waiver — Any of the above conditions can be waived if you get permission from the
copyright holder.Public Domain — Where the work or any of its elements is in the public domain under
applicable law, that status is in no way affected by the license.Other Rights — In no way are any of the following rights affected by the license:o Your fair dealing or fair use rights, or other applicable copyright exceptions and
limitations;o The author's moral rights;o Rights other persons may have either in the work itself or in how the work is used,
such as publicity or privacy rights.Notice — For any reuse or distribution, you must make clear to others the license
terms of this work. The best way to do this is with a link to the web page http://creativecommons.org/licenses/by/2.5/
Top Related