Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law...
Transcript of Performance of Parallel Programs - ELC Lab · Performance of Parallel Programs 6 Amdahl’s Law...
1
Performance of Parallel Programs
Fall 2011
Programming Practice
Bernd BurgstallerYonsei University
Bernhard ScholzThe University of Sydney
Performance of Parallel Programs
2
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurements & Profiling
Performance of Parallel Programs
3
Limits to Performance Scalability
Not all programs are “embarrassingly” parallel.
Programs have sequential parts and parallel parts.
Another example: color -> grayscale conversion (Assignment 2)
Sequential part 1: read the input data from file.
Parallel part: perform the grayscale conversion.
Sequential part 2: write the converted data to the output file.
1 a = b + c;1 a = b + c;2 d = a + 1;3 e = d + a;4 for (k=0; k < e; k++)5 M[k] = 1;
Sequential part: cannot be parallelized because of data dependencies.
Parallel part: no data dependence, the different loop iterations (Line 5) can be executed in parallel.
Performance of Parallel Programs
4
Amdahl’s Law
In 1967, Gene Amdahl, then IBM computer mainframe architect, stated that
“The performance improvement to be gained from some faster mode of execution is limited by the fraction of the time that the faster mode can be used.”
“Faster mode of execution” here means program parallelization.
The potential speedup is defined by the fraction of the code that can be parallelized.
sequential part
parallelizable part
sequential part
sequential part
T1 T2 T3 T4 T5
sequential part
25 seconds
25 seconds
50 seconds
+
+
100 seconds
Use 5 threads for the parallelizable part:
25 seconds
25 seconds
10 seconds
+
+
60 seconds
time
Performance of Parallel Programs
5
Amdahl’s Law (cont.)
Speedup = old running time / new running time= 100 seconds / 60 seconds= 1.67
The Parallel version is 1.67 times faster than the sequential version.
sequential part
parallelizable part
sequential part
sequential part
T1 T2 T3 T4 T5
sequential part
25 seconds
25 seconds
50 seconds
+
+
100 seconds
Use 5 threads for the parallelizable part:
25 seconds
25 seconds
10 seconds
+
+
60 seconds
time
Performance of Parallel Programs
6
Amdahl’s Law (cont.)
We may use more threads executing in parallel for the parallelizable part, but the sequential part will remain the same.
The sequential part of a program limits the speedup that we can achieve!
Even if we theoretically could reduce the parallelizable part to 0 seconds, the best possible speedup in this example would be
speedup = 100 seconds / 50 seconds = 2.
sequential part
parallelizable part
sequential part
time
sequential part
sequential part
25 seconds
25 seconds
50 seconds
+
+
100 seconds
25 seconds
25 seconds
2 seconds
52 seconds
Performance of Parallel Programs
7
Amdahl’s Law (cont.)
p = fraction of work that can be parallelized.
N = the number of threads executing in parallel.
Speedup =old_running_time
new_running_time=
new_running_time = (1-p) * old_running_time +p * old_running_time
N
1
(1 - p) + pN
Observation: if the number of threads goes to infinity ( ), the speedup becomes
Parallel programming pays off for programs which have a largeparallelizable part.
Performance of Parallel Programs
8
Amdahl’s Law (Examples)
p = fraction of work that can be parallelized.
N = the number of threads executing in parallel.
speedup =old_running_time
new_running_time=
1
(1 - p) + pN
Example 1: p=0, an embarrassingly sequential program.
Example 2: p=1, an embarrassingly parallel program.
Example 3: p=0.75, N = 8.
Example 4: p=0.75, N = .
Performance of Parallel Programs
9
0
20
40
60
80
100
1 2 4 8 16
32
64
128
256
512
102
4
204
8
409
6
819
2
163
84
Sp
eed
up
fac
tor
Number of Cores
80% parallel90% parallel95% parallel99% parallel
Speedup according to Amdahl’s Law
Performance of Parallel Programs
10
Gustafson’s Law
In 1988, John L. Gustafson reported the achieved speedup on an experiment as
Seemed like contradiction to Amdahl’s law:
With Amdahl’s law, no matter how many cores we add, the sequential part of the application imposes upper bound on possible speedup (see previous slide).
With Gustafson’s law, unlimited parallelism benefits application.
Example: N=256, p=99%, np=1%
Performance of Parallel Programs
11
Amdahl’s Law uses for the sequential portion
That way, sequential portion does not scale with higher N.
Gustafson’s Law scales down the sequential part (np) with higher N:
Gustafson’s Law (cont.)
Sequential portions differ with Amdahl’s and Gustafson’s Laws.
Assume:
Amdahl’s and Gustafson’s Laws can be proven equivalent if wefactor in the differences wrt. the scalability of the sequentialportions of the code.
Performance of Parallel Programs
12
Amdahl’s Law uses for the sequential portion.
That way, sequential portion does not scale with higher N.
Gustafson’s Law scales down the sequential part (np) with higher N:
Gustafson’s Law (cont.)
Example: from Slide 10:
N=256, T(256)=100s, np(256)=1% p(256)=99%
T(1)=1+256*99=25345s
Amdahl: ts(1)/T(1) = 1/25345=0.0000395% (compare with np=1% on Slide 10!)
(compare with Gustafson’s speedup on prev. slide!)
Performance of Parallel Programs
13
0
50
100
150
200
250
300
1 2 4 8 16
32
64
128
256
Sp
eed
up
fac
tor
Number of Cores
50% parallel80% parallel95% parallel99% parallel
Speedup according to Gustafson’s Law
Performance of Parallel Programs
14
Amdahl’s and Gustafson’s Laws
Fundamental question: given an application, is the speedup potential bounded (as implied by Amdahl’s Law), or unbounded (as implied by Gustafson’s law)?
Answer lies in the sequential part of the application:
If the share of sequential part stays constant as the number of available cores increases Amdahl’s Law, application has upper bound on possible speedup.
• Example: application consisting of grayscale conversion (parallel part) and file I/O (sequential part). If file I/O does not become faster, then speedup capped (limited) through Amdahl’s Law.
If the share of sequential part of the application diminishes with more cores added Gustafson’s Law• Example: application consisting of grayscale conversion (parallel part) and file I/O
(sequential part). If file I/O becomes faster as we add more cores (e.g., because I/O itself was improved/parallelized), then Gustafson’s Law applies.
In both cases, the sequential portion of the application needs to be reduced to achieve higher speedups.
Performance of Parallel Programs
15
Performance Scalability
On the previous slides, we assumed that performance directly scales with the number N of threads/cores used for the parallelizable part p.
E.g., if we double the number of threads/cores, we expect the execution time to drop in half. This is called linear speed-up.
However, linear speedup is an “ideal case” that is often not achievable.
number of threads/cores
sequential program
The speedup graph shows that the curves level off as we increase the number of cores. Result of keeping the problem size constant the amount of work per thread decreases overhead for thread creation aso becomes more significant.
Other reasons possible also (memory hierarchy, highly-contended lock, ...).
Ideal efficiency of 1 indicates linear speedup. Efficiency > 1 indicates super-linear speedup.
Super-linear speedups are possible due to increased #registers and caches with multiple cores.
spee
du
p
Performance of Parallel Programs
16
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurement & Profiling
Performance of Parallel Programs
17
Granularity of parallelism = frequency of Interaction between threads
Fine-grained Parallelism
Small amount of computational work between communication / synchronization stages.
Low computation to communication ratio.
High communication / synchronization overhead.
Less opportunity for performance improvement.
Coarse-grained Parallelism
Large amount of computational work between communication / synchronization stages.
High computation to communication ratio.
Low communication / synchronization overhead.
More opportunity for performance improvement.
Harder to load-balance effectively.
time
Performance of Parallel Programs
18
Load Balancing Problem
Threads that finish early have to wait for the thread with the largest amount of work to complete.
This leads to idle times and lowers processor utilization.
synchronization
workwork
work
work
synchronization
idleidle
idle
time
Performance of Parallel Programs
19
Load imbalance: work not evenly assigned to cores/SPEs. Underutilizes parallelism!
The assignment of work, not data, is key. Static assignment = assignment at program writing time
cannot be changed at run-time.
More prone to imbalance.
Dynamic assignment = assignment at run-time. Quantum of work must be large enough to amortize overhead.
Example:Threads fetch work from a work queue.
With flexible allocations, load balance can be solved late in the design programming cycle.
Load Balancing
T2
work queue
T1 T2
work queue
T1 T1 T2
work queue
step 1: step 2: step 3:
Performance of Parallel Programs
20
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurement & Profiling
Performance of Parallel Programs
21
Let’s consider a simple problem
Sum up the numbers in array[] Array has length values.
Sequential solution:
int array[length];sum = 0;
for(i=0; i<length; i++){sum = sum + array[i];
}
Performance of Parallel Programs
22
Write a parallel program
We need to know something about the machine...
Let’s use a multicore architecture!
RAM
(Main Memory)
L2 Cache
L1 Cache L1 Cache
Core 0 Core 1
Main memory access is extremely slow compared to the processing speed of today’s CPUs.
A cache is a small but fast memory that is used to store copies of data from the main memory.
Holds copies of data from the most recently used locations in main memory.
Memory hierarchy:
• registers->L1 cacheL2 cachemain memory.
Cache miss: CPU wants to access data which is not in the cache.
• Need to access data in level below, eventually in main memory (slow!).
Performance of Parallel Programs
23
Divide into several parts...
Solution for several threads: divide the array into several parts.
1 4 3 12 1 42 16 17 9 45 33 12 18 4 7 22array
length=16 t=4
Thread 0 Thread 1 Thread 2 Thread 3
int length_per_thread = length / t;int = start = id * length_per_thread;sum = 0;
for(i=start; i<start+length_per_thread; i++){sum = sum + array[i];
}
What’s wrong?What’s wrong?
Performance of Parallel Programs
24
Divide into several parts...
Solution for several threads: divide the array into several parts.
1 4 3 12 1 42 16 17 9 45 33 12 18 4 7 22array
length=16 t=4
Thread 0 Thread 1 Thread 2 Thread 3
}
int length_per_thread = length / t;int start = id * length_per_thread;sum = 0;pthread_mutex_t m;
for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);
}
Sum is a shared variable. Need to ensure mutual exclusion when Sum is a shared variable. Need to ensure mutual exclusion when accessing sum from different threads.
Performance of Parallel Programs
25
The correct program is slow...
We are mainly acquiring/releasing the mutex. lock()/unlock induces huge overhead!
Threads serialize at the mutex. Have to wait for each other.
Memory traffic for sum and mutex variable.
62
925
3100
sequential
array of 16 million
unsigned chars
Performance of Parallel Programs
26
Closer look: motion of sum and m
The variables sum and m need to be moved across the memory hierarchy because both threads access them.
Takes time.
Lock contention at mutex m.
Overhead for lock/unlock operations.
int length_per_thread = length / t;int start = id * length_per_thread;unsigned long long sum = 0;pthread_mutex_t m;
for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);
}RAM
(Main Memory)
L2 Cache
L1 Cache L1 Cache
Core 0 Core 1
Performance of Parallel Programs
27
Accumulate into private sum variable
Each thread adds into its own memory (variable private_sum[]).
Combine sums at the end.
#define t 4int length_per_thread = length / t;int start = id * length_per_thread;unsigned long long private_sum[4]={0, ...};unsigned long long sum = 0;pthread_mutex_t m;
for(i=start; i<start+length_per_thread; i++){private_sum[id] = private_sum[id] + array[i];
}
pthread_mutex_lock(&m);sum = sum + private_sum[id];
pthread_mutex_unlock(&m);
Performance of Parallel Programs
28
Keeping up, but not gaining 1-thread version almost as good as sequential
apart from overhead for creating threads
It gets worse with 2 or more threads. With the 9th thread, we introduce a new cache line and things get slightly better.
However, still not faster than the sequential version!
seq
array of 16 million
unsigned chars
Performance of Parallel Programs
29
False sharing
Private sum variable = private cache-line.
RAM
(Main Memory)
L2 Cache
L1 Cache L1 Cache
Core 0 Core 1
private_sum[0] private_sum[1]
private_sum[0] private_sum[1]
private_sum[0] private_sum[1] private_sum[0] private_sum[1]
cache line
Performance of Parallel Programs
30
False sharing (zoomed in)
CPU cannot load individual variables into cache
Can only load in units of cache-lines
64 bytes, usually
2 thread-private variables in the same cache-line are as bad as a single shared variable
ping-pong of cacheline when simultaneously accessing private counters from 2 threads
Core 0 Core 1
CacheLine
Thread 0 Thread 1
CacheLine
L1 Cache L1 Cache
L2 Cache
from main memoryCPU
Performance of Parallel Programs
31
False sharing (solution)
Through padding memory, we can force each private counter into a separate cache line
the cache-line with a thread’s private counter is loaded into the core’s L1 cache
no need to exchange the other counter’s cachelines between L1 caches.
Core 0 Core 1
CacheLine
Thread 0 Thread 1
CacheLine
L1 Cache L1 Cache
L2 Cache
from main memory
priv_sum[0]priv_sum[0]priv_sum[1]priv_sum[1]
priv_sum[0]priv_sum[0]
priv_sum[1]priv_sum[1]
Performance of Parallel Programs
32
Force into different cache lines
• Padding the private sum variables forces them into separate cache lines and removes false sharing.
• Two threads now almost twice as fast as one:
struct padded_int{ unsigned long long value; // 8 bytes widechar padding [56]; //assuming a cache line is 64 bytes
} private_sum[MaxThreads];
62
77
39
2216 16
array of 16 million
unsigned chars
Performance of Parallel Programs
33
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss next
Performance Measurement & Profiling
Performance of Parallel Programs
34
The 4 Sources of Performance Loss
Linear speedup with parallel processing frequently not attainable for following reasons:
1) Overhead which sequential computation does not have to pay
2) Non-parallelizable computation
3) Idle processors
4) Contention for resources
All other sources are special cases of these four!
In next slides, we’ll consider those four sources of performance loss.
From the textbook, Chapter 3.
Performance of Parallel Programs
35
Example: The 4 Sources of Performance Loss
Task: Let’s assume we want to paint the four walls of our class-room.
A single painter will take 1 day to paint all for walls, and then it takes 1 day for the paint to dry.
Adding more painters to the job will speed up, but we’ll face the 4 sources of performance loss:
1) Overhead which sequential computation does not have to pay:
we need to decide which painters will paint which parts (walls) of the class-room
2) Non-parallelizable computation:
after the painters finish, it’ll take 1 day for the paint to dry
3) Idle processors:
if we use 1M painters, they have to queue up to enter the class-room and do their paint-job
4) Contention for resources
Assume we only have one bucket of paint shared by all participating painters...
Performance of Parallel Programs
36
Loss 1: Overhead Any cost caused in parallel solution but not required by sequential
solution is called overhead.
Examples: creating/joining threads, synchronization, ...
Overhead 1: Communication
Because sequential solution does not have to communicate with other threads, all communication is overhead.
• Example: private sums have to communicated back to main thread.
Cost for communication depends on hardware
• Examples: shared-memory communication cheaper than sending messages over network, non-uniform memory access (NUMA) architectures.
Overhead 2: Synchronization
Arises when one thread must wait for ‘event’ on another thread.
Example: thread must wait for another thread to free resource, e.g., mutex, in the sum computation example. Sequential solution does not require synchronization!
Performance of Parallel Programs
37
Loss 1: Overhead (cont.) Overhead 3: Computation
Parallel computations almost always perform extra computations not needed in sequential solution.
Examples:• threads in gray-scale conversion need to compute the part of the array which they
should work on.
• Initializations of counters aso executed in each of the parallel threads
Overhead 4: Memory
Parallel computations frequently require more memory than sequential computations.
• Example: padding in the parallel sum computation.
Padding is a small memory overhead that actually improves performance.
Applications with significant memory overhead will experience performance loss.
Performance of Parallel Programs
38
Loss 2: Non-parallelizable computations
Non-parallelizable computations limit potential benefit from parallelism.
Amdahl’s law observes that if 1/S of a computation is inherently sequential, maximum speedup limited to a factor of S.
Special case: redundant computations
N threads execute loop k times:
for(i=0; i<k; i++) {...}
all N threads increment their private loop variable, test loop condition, ...
Idle threads waiting for one thread to perform computation
Example: disk I/O
Performance of Parallel Programs
39
Loss 3: Idle Times
Ideally, all processors/cores are working all the time.
Thread might become idle because
1) of lack of work
2) it is waiting for external event, such as arrival of data from other thread
• Example: producer-consumer relationship
3) Load Imbalance
• uneven distribution of work to processors/cores
• Example: sequential computation running on single core, all other cores idle
4) Memory-bound computations
• Main memory very slow compared to CPU
• Caches hide memory-latency, but applications with poor locality or large data-sets spend 100s of CPU-cycles waiting for data from main memory.
Performance of Parallel Programs
40
Loss 4: Contention for Resources
Contention is degradation of system performance caused by competition for a shared resource.
Example: highly contended lock.
Contention can degrade performance beyond the overhead observed from a 1-thread solution.
it means: 2 threads even slower than 1 thread
Example: both true sharing and false sharing create excessive load on the memory bus (bus traffic)
Affects even threads that access other memory locations (and hence do not contend for shared data)!
Performance of Parallel Programs
41
Performance Trade-offs
We have seen 4 sources of performance loss.
Limiting one source of performance loss may increase another
Important to understand the trade-offs between different sources of performance loss.
Trade-off 1: Communication versus Computation
Often possible to reduce communication overhead by performing additional computations.
Redundant computations
• Re-compute value locally (within a thread) instead of transmitting from other thread
• works if cost (re-computation) < cost (transmission)
• Example: pseudo-random numbers
Overlapping Communication and Computation• Communication that is independent of computation can be performed while
computation is going on (hiding communication behind computation)
• Makes program more complicated (DMA-transfers, wait-for-completion, ...)
• Examples: in our lecture on openCL programming
Performance of Parallel Programs
42
Performance Trade-offs (cont.)
Trade-off 2: Memory versus Parallelism
Parallelism can often be increased at cost of increased memory usage
Privatization (e.g., private_sum variable)
Padding
Trade-off 3: Overhead versus Parallelism
Parallelize Overhead• Final accumulation of private_sums on Slide #37 can become bottleneck with
large number of threads parallelize overhead itself!
• We will discuss in subsequent lecture
Load-Balance versus Overhead• increased parallelism can improve load-balance
• over-decomposing problem (=overhead) in fine-grained work-units
Granularity Trade-offs• Increase granularity of interaction
• Examples:
1) each thread grabs several work items at once from work-queue
2) send whole row/column of matrix instead of individual elements
Performance of Parallel Programs
43
Summary: Common Performance Bottlenecks
Huge sequential part of a program.
Amdahl’s Law tells us that we cannot gain much in such a case.
Highly contended lock
All threads need to queue up to enter their critical section.
High communication overhead, little computation.
Remember: computation is our goal! Communication and synchronization is just overhead due to multiple threads.
Sharing of variables (mutexes, semaphores are also variables!)• rand() uses a seed variable (shared!) use rand_r() instead!
False sharing of variables in a cache line.
Poor load balancing.
Some threads do all the work (are overworked!), other threads are idle. Processor resources are not well-utilized.
Scalar code
Use vector units (SIMDization of scalar code) as much as possible.
Performance of Parallel Programs
44
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurement & Profiling next
With special thanks to our interns Hyewon, Changwook and Jongmyeong!
• For their kind exploration and lecture materials on Intel Vtune
Performance of Parallel Programs
45
Measuring Performance
Execution time ... what’s time?
We can measure the time from start to termination of our program.
Also called wallclock time.
gcc foo.cgcc foo.ctime ./a.outreal 0m4.159suser 0m0.008ssys 0m0.026s
elapsed wallclock time
time executed in user mode
time executed in kernel mode
(see next slide)
time
stopclear/start
Performance of Parallel Programs
46
Difference between user and system time
A program requests services from the operating system (file I/O, dynamic memory, ...) via calls to the operating system kernel. Called system calls ( ).
Depicted in blue in the above code example.
The time spent in system calls is counted as system time.
The rest of the code is executed outside of the system kernel. Counted as user time.
Depicted in orange in the above example.
Question: why is user time + system time ≠ wallclock time ?
Linux operating system kernel
T1 T2 T3
program 1
T1 T2
program 2
T1 T2 T3
program 3
Core 1 Core 2
#include<stdio.h>
int main() {int i, *j, k;FILE f = fopen(...);i = 255;j = malloc(255*sizeof(int));for (k=0;k<i;k++)
*(j+k) = 0;...
}
user
time
system
time
Performance of Parallel Programs
47
Wanted: an unloaded machine
Principal problem with wallclock time: Different programs share the
available CPUs/cores.
The Linux kernel schedules the threads of all programs on those CPUs/cores.
Every thread gets to run for a little while (its timeslice), then another thread gets to run.
Programs get in the way of each other. If we are interested in the execution time of program 1, the measured wallclock time contains also time executing program 2 and 3.
Linux operating system kernel
T1 T2 T3
program 1
T4 T5
program 2
T6 T7 T8
program 3
Core 1 Core 2
CPU time (Core 1)
T1T4 T6 T2 T4 T6 T2 T4
CPU time (Core 2)T3T5 T5 T3T7 T8
clear/start program 1 program 1 terminates
T6T4
T8T8 T3 T8
T4 T1
Performance of Parallel Programs
48
Measuring Performance
On an unloaded machine, wallclock time gives us a crude program performance indication.
Run the program several times
Difference between cold start and warm start
• Linux file system caches
• instruction caches, data caches
Wallclock time does not tell us in which parts of the program we spend time.
gettimeofday() allows us to measure wallclocktime for specific parts of the program.
We will study more elaborate performance measurement methods in the following slides.
int main(void) {
float s1=0, s2=0; int i,j;
for ( j = 0; j < T; j++ ) {
for ( i = 0; i < N; i++ ) {
b[i] = 0;
}for ( i = 0; i < N; i++) {
s1 += a[i] + b[i];
s2 += a[i] * a[i] * b[i];
}
}
}
record start-time
record stop-time
Performance of Parallel Programs
49
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurement & Profiling
With the kind help of our internship students:
Choi Hyewon, Jung Changwook and Lee Jongmyeong
Performance of Parallel Programs
50
Dynamic ProfilingProfiling = Program Instrumentation + Execution CPUs provide performance counters to measure various
events: cache hits
cache misses
number of instructions executed
pipeline stalls
number of loads from memory
number of stores to memory
And then there is the Heisenberg effect: measuring can perturb a program’s execution time!
GNU gprof Intel VTune
http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239144.htm
OProfile Valgrind Many more!
Performance of Parallel Programs
51
Dynamic Profiling Profiling through instrumentation
manual or automatic adding of counters
• requires changes of your program Heisenberg!
measure exactly what you want, and where you want
Event-based profiling
CPU provides hw events that it can count with performance counters
• for examples see previous page
works without modifying code
but you should compile with –g (for source-level information)
Sample-based profiling
interrupt execution every x seconds and sample
works without modifying code (-g, same as above)
time
stopclear/start
Performance of Parallel Programs
52
Common Profiling Workflow
gcc -g
-g tells gcc to generate debug information for source-level analysis & display of results
applicationsourcecode
binarycompile
binaryanalysis
application runs(profiling executions)
profileinterpretation
performanceprofile
source correlation& UI
Performance of Parallel Programs
53
Intel Vtune Amplifier XE 2011
A dynamic profiler, provided by Intel.
• It works on both Intel/AMD hardware, but advanced
analyses available only for Intel processors, esp.
Core(TM) 2 Processors and later versions.
• Environment :
• 32/64-bit, x86-based architectures.
• Platforms : MS Windows, Linux (what we’re discussing)
Nov. 30th 17:00 we’ll have a seminar from Intel on VTune and related tools! Please mark already now in your calendar!
Performance of Parallel Programs
54
• Algorithm Analysis
– Hotspots: figuring out the time-consuming part of the code.
– Locks and Waits: analyzing the parts where threads wait for some
synchronizations and I/O operations. We focus on waiting caused by
synchronization, i.e. lock and unlock operations on mutexes and
semaphores.
• Advanced Analysis : analyzing hardware issues, e.g., false and true
sharing, memory access patterns, etc.
Intel Vtune Amplifier XE 2011 (cont.)
Performance of Parallel Programs
55
Example Source Code
int array[length];sum = 0;
get_randarrary() //fill array with random data
for(i=0; i<length; i++){sum = sum + array[i];
}
void get_randarray(){
size_t i; srand((unsigned)time(NULL) + (unsigned)getpid());
for(i=0; i<NrElements; ++i)data[i] = ((rand()% 128000000) + 1);
}
Performance of Parallel Programs
56
Hot-Spot Analysis
Elapsed wallclock time: 1.932s
computation time across all threads: 2.54s
1.596 + 0.92 + 0.016 + 0.008
VTune identified get_randarrayas the function consuming most execution time (1.596s).
This is where you would start optimizing your program
functions are listed longest-running to shortest
4 threads consumed alltogether0.92s.
Total thread count: main thread plus 4 Pthreads
Performance of Parallel Programs
57
Hotspots by CPU usage
Most time (1.5s), only one CPU core was runnig
For a small fraction of the execution time, 2 or 4 cores executing in parallel
idealy, 80%-100% of the system’s cores should be running (green range)
Performance of Parallel Programs
58
Example Source Code
int array[length];sum = 0;
get_randarrary() //fill array with random data
pthread_mutex_t m;
for(i=start; i<start+length_per_thread; i++){pthread_mutex_lock(&m);sum = sum + array[i];pthread_mutex_unlock(&m);
}
Let’s check the locking overhead...
Performance of Parallel Programs
59
Locks and Waits
Wait time: time (sum) of all threads spent waiting for mutex.
Spin time: the time threads are actively burning CPU cycles in a synchronization construct
Wait count : the overall number of times the system wait API was called for the analyzed application.
Link to Intel documentation
Performance of Parallel Programs
60
Locks and Waits (cont.)
Very little parallelism, due to mutex serialization
Performance of Parallel Programs
61
Timeline
The threads are busy synchronizing, almost no work done...
...
...
...
Performance of Parallel Programs
62
Intel VTune Sample Callgraph
Performance of Parallel Programs
63
Outline
The fundamental laws of parallelism
Amdahl’s Law
Gustafson’s Law
Scalability
Load Balancing
Sources of Performance Loss
False sharing case study
The 4 sources of performance loss
Performance Measurement & Profiling
With the kind help of our internship students:
Choi Hyewon, Jung Changwook and Lee Jongmyeong