Threaded Programming Methodology Intel Software College.
-
date post
22-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of Threaded Programming Methodology Intel Software College.
Threaded Programming Methodology
Intel Software College
2
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Objectives
After completion of this module you will
• Be able to rapidly prototype and estimate the effort required to thread time consuming regions
3
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
A Generic Development Cycle
Case Study: Prime Number Generation
Common Performance Issues
4
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
What is Parallelism?
Two or more processes or threads execute at the same time
Parallelism for threading architectures
• Multiple processes• Communication through Inter-Process Communication (IPC)
• Single process, multiple threads• Communication through shared memory
5
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
n = number of processors
Tparallel = {(1-P) + P/n} Tserial
Speedup = Tserial / Tparallel
Amdahl’s Law
Describes the upper bound of parallel execution speedup
Serial code limits speedup
(1-P
)P
T ser
ial
(1-P
)
P/2
0.5 + 0.5 + 0.250.25
1.0/0.75 = 1.0/0.75 = 1.331.33
n = 2n = 2n = n = ∞∞
P/∞∞
…
0.5 + 0.5 + 0.00.0
1.0/0.5 = 1.0/0.5 = 2.02.0
6
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Processes and Threads
Modern operating systems load programs as processes
• Resource holder• Execution
A process starts executing at its entry point as a thread
Threads can create other threads within the process• Each thread gets its own stack
All threads within a process share code & data segments
Code segment
Data segment
thread
main()
…thread thread
Stack Stack
Stack
7
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threads – Benefits & Risks
Benefits
• Increased performance and better resource utilization• Even on single processor systems - for hiding latency and
increasing throughput
• IPC through shared memory is more efficient
Risks
• Increases complexity of the application
• Difficult to debug (data races, deadlocks, etc.)
8
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Commonly Encountered Questions with Threading Applications
Where to thread?
How long would it take to thread?
How much re-design/effort is required?
Is it worth threading a selected region?
What should the expected speedup be?
Will the performance meet expectations?
Will it scale as more threads/data are added?
Which threading model to use?
9
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Prime Number Generation
bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++;
return (factor > limit);}
void FindPrimes(int start, int end){ int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}
i factor
61 3 5 7 63 365 3 567 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 577 3 5 7 79 3 5 7 9
10
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 1
Run Serial version of Prime code
• Locate PrimeSingle directory
• Compile with Intel compiler in Visual Studio
• Run a few times with different ranges
11
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Development Methodology
Analysis
• Find computationally intense code
Design (Introduce Threads)
• Determine how to implement threading solution
Debug for correctness
• Detect any problems resulting from using threads
Tune for performance
• Achieve best parallel performance
12
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Development Cycle
AnalysisAnalysis–VTune™ Performance AnalyzerVTune™ Performance Analyzer
Design (Introduce Threads)Design (Introduce Threads)–Intel® Performance libraries: IPP and MKLIntel® Performance libraries: IPP and MKL
–OpenMP* (Intel® Compiler)OpenMP* (Intel® Compiler)
–Explicit threading (Win32*, Pthreads*)Explicit threading (Win32*, Pthreads*)
Debug for correctnessDebug for correctness–Intel® Thread CheckerIntel® Thread Checker
–Intel DebuggerIntel Debugger
Tune for performanceTune for performance–Intel® Thread ProfilerIntel® Thread Profiler
–VTune™ Performance AnalyzerVTune™ Performance Analyzer
13
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Let’s use the project PrimeSingle for analysis • PrimeSingle <start> <end>
Usage: ./PrimeSingle 1 1000000
Analysis - Sampling
Use VTune Sampling to find hotspots in application
bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++;
return (factor > limit);}
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}Identifies the time consuming regions
14
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Analysis - Call Graph
This is the level in the call tree where we need to thread
Used to find proper level in Used to find proper level in the call-tree to threadthe call-tree to thread
15
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Analysis
Where to thread?
• FindPrimes()
Is it worth threading a selected region?
• Appears to have minimal dependencies
• Appears to be data-parallel
• Consumes over 95% of the run timeBaseline measurement
16
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 2
Run code with ‘1 5000000’ range to get baseline measurement
• Make note for future reference
Run VTune analysis on serial code
• What function takes the most time?
17
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Foster’s Design Methodology
From Designing and Building Parallel Programs by Ian Foster
Four Steps:
• PartitioningPartitioning• Dividing computation and data
• CommunicationCommunication• Sharing data between computations
• AgglomerationAgglomeration• Grouping tasks to improve performance
• MappingMapping• Assigning tasks to processors/threads
18
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Designing Threaded Programs
PartitionPartition
• Divide problem into tasks
CommunicateCommunicate
• Determine amount and pattern of communication
AgglomerateAgglomerate
• Combine tasks
MapMap
• Assign agglomerated tasks to created threads
TheProblem
Initial tasks
Communication
Combined Tasks
Final Program
19
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Parallel Programming Models
Functional Decomposition
• Task parallelism
• Divide the computation, then associate the data
• Independent tasks of the same problem
Data Decomposition
• Same operation performed on different data
• Divide data into pieces, then associate computation
20
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Decomposition Methods
Functional Decomposition
• Focusing on computations can reveal structure in a problem
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC
Domain Decomposition
• Focus on largest or most frequently accessed data structure
• Data Parallelism• Same operation applied to all data
Atmosphere Model
Ocean Model
Land Surface Model
Hydrology Model
21
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Pipelined Decomposition
Computation done in independent stages
Functional decomposition
• Threads are assigned stage to compute
• Automobile assembly line
Data decomposition
• Thread processes all stages of single instance
• One worker builds an entire car
22
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
LAME Encoder Example
LAME MP3 encoder
• Open source project
• Educational tool used for learning
The goal of project is
• To improve the psychoacoustics quality
• To improve the speed of MP3 encoding
23
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
LAME Pipeline Strategy
Frame NFrame N + 1
Time
Other N
Prelude N
Acoustics N
Encoding N
T 2
T 1
Acoustics N+1
Prelude N+1
Other N+1
Encoding N+1
Acoustics N+2
Prelude N+2
T 3
T 4
Prelude N+3
Hierarchical Barrier
OtherPrelude Acoustics Encoding
FrameFetch next frameFrame characterizationSet encode parameters
Psycho AnalysisFFT long/shortFilter assemblage
Apply filteringNoise ShapingQuantize & Count bits
Add frame headerCheck correctnessWrite to disk
24
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design
What is the expected benefit?
How do you achieve this with the least effort?
How long would it take to thread?
How much re-design/effort is required?
Rapid prototyping with OpenMPRapid prototyping with OpenMP
Speedup(2P) = 100/(96/2+4) = ~1.92X
25
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
OpenMP
Fork-join parallelism:
• Master thread spawns a team of threads as needed
• Parallelism is added incrementally• Sequential program evolves into a parallel program
Parallel Regions
Master Thread
26
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design
#pragma omp parallel for
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}
OpenMP
Create threads here for this parallel region
Divide iterations of the forfor loop
27
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 3
Run OpenMP version of code
• Locate PrimeOpenMP directory and solution
• Compile code
• Run with ‘1 5000000’ for comparison• What is the speedup?
28
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design
What is the expected benefit?
How do you achieve this with the least effort?
How long would it take to thread?
How much re-design/effort is required?
Is this the best speedup possible?
Speedup of 1.40X (less than 1.92X)Speedup of 1.40X (less than 1.92X)
29
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness
Is this threaded implementation right?
No! The answers are different each time …
30
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness
Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks
IntelIntel®® Thread Checker Thread Checker
VTune™ Performance AnalyzerVTune™ Performance Analyzer
+DLLs (Instrumented)
Binary Instrumentation
Primes.exe Primes.exe (Instrumented)
Runtime Data
Collector
threadchecker.thr(result file)
31
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Checker
32
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 4
Use Thread Checker to analyze threaded application
• Create Thread Checker activity
• Run application
• Are any errors reported?
33
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness
How much re-design/effort is required?
How long would it take to thread?
Thread Checker reported only 2 dependencies, so effort required
should be low
34
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Debugging for Correctness
#pragma omp parallel for
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
#pragma omp critical
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}
#pragma omp critical
{
gProgress++;
percentDone = (int)(gProgress/range *200.0f+0.5f)
}
Will create a critical section for this reference
Will create a critical section for both these references
35
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 5
Modify and run OpenMP version of code
• Add critical region pragmas to code
• Compile code
• Run from within Thread Checker• If errors still present, make appropriate fixes to code and run
again in Thread Checker
• Run with ‘1 5000000’ for comparison• Compile and run outside Thread Checker• What is the speedup?
36
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Correctness
Correct answer, but performance has slipped to ~1.331.33X
Is this the best we can expect from this algorithm?
No! From Amdahl’s Law, we expect speedup close to 1.9X
37
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Common Performance Issues
Parallel Overhead
• Due to thread creation, scheduling …
Synchronization
• Excessive use of global data, contention for the same synchronization object
Load Imbalance
• Improper distribution of parallel work
Granularity
• No sufficient parallel work
38
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Tuning for Performance
Thread Profiler pinpoints performance bottlenecks in threaded applications
Thread ProfilerThread Profiler
VTune™ Performance AnalyzerVTune™ Performance Analyzer
+DLL’s (Instrumented)
Binary Instrumentation
Primes.c
Primes.exe (Instrumented)
Runtime Data
Collector
Bistro.tp/guide.gvs(result file)
CompilerSource
Instrumentation
Primes.exe
/Qopenmp_profile
39
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for OpenMP
40
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for OpenMP
Speedup Graph
Estimates threading speedup and potential speedup
– Based on Amdahl’s Law computation
Gives upper and lower bound estimates
41
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for OpenMP
serial
serial
parallel
42
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for OpenMP
Thread 0
Thread 1
Thread 2
Thread 3
43
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler (for Explicit Threads)
44
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler (for Explicit Threads)
Why so many transitions?
45
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance
This implementation has implicit synchronization calls
This limits scaling performance due to the resulting context switches
Back to the design stage
46
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 6
Use Thread Profiler to analyze threaded application
• Use /Qopenmp_profile to compile and link
• Create Thread Profiler Activity (for explicit threads)
• Run application in Thread Profiler
• Find the source line that is causing the threads to be inactive
47
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance
Is that much contention expected?
The algorithm has many more updates than the 10 needed for showing progress
void ShowProgress( int val, int range ){ int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f);
if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone);}
void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
This change should fix the contention issue
48
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Design
Goals
• Eliminate the contention due to implicit synchronization
Speedup is 2.32X2.32X ! Is that right?Speedup is 2.32X2.32X ! Is that right?
49
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance
Our original baseline measurement had the “flawed” progress update algorithm
Is this the best we can expect from this algorithm?
Speedup is actually 1.40X 1.40X (<<1.9X)! Speedup is actually 1.40X 1.40X (<<1.9X)!
50
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 7
Modify ShowProgress function (both serial and OpenMP) to print only the needed output
• Recompile and run the code• Be sure no instrumentation flags are used
• What is speedup from serial version now?
if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }
51
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited
Still have 62% of execution time in
locks and synchronization
52
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited
Let’s look at the OpenMP locks…
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) )#pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}
Lock is in a loop
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}
53
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Re-visited
Let’s look at the second lock
void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;
#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
This lock is also being called within a loop
void ShowProgress( int val, int range ){ long percentDone, localProgress; static int lastPercentDone = 0;
localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
54
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 8
Modify OpenMP critical regions to use InterlockedIncrement instead
• Re-compile and run code
• What is speedup from serial version now?
55
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Thread Profiler for OpenMP
Thread 0
342 factors to test 116747
Thread 1
612 factors to test 373553
Thread 2
789 factors to test 623759
Thread 3
934 factors to test 873913
500000
250000
750000
1000000
56
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Fixing the Load Imbalance
Distribute the work more evenly
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}
Speedup achieved is 1.68X1.68X
57
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 9
Modify code for better load balance
• Add schedule (static, 8) clause to OpenMP parallel for pragma
• Re-compile and run code
• What is speedup from serial version now?
58
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Final Thread Profiler Run
Speedup achieved is 1.80X1.80XSpeedup achieved is 1.80X1.80X
59
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Comparative Analysis
Threading applications require multiple iterations of going through
the software development cycle
60
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threading MethodologyWhat’s Been Covered
Four step development cycle for writing threaded code from serial and the Intel® tools that support each step
• Analysis
• Design (Introduce Threads)
• Debug for correctness
• Tune for performance
Threading applications require multiple iterations of designing, debugging and performance tuning steps
Use tools to improve productivity
62
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Backup Slides
63
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Parallel Overhead
Thread Creation overhead
• Overhead increases rapidly as the number of active threads increases
Solution
• Use of re-usable threads and thread pools• Amortizes the cost of thread creation • Keeps number of active threads relatively constant
64
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Synchronization
Heap contention• Allocation from heap causes implicit synchronization
• Allocate on stack or use thread local storage
Atomic updates versus critical sections• Some global data updates can use atomic operations (Interlocked family)
• Use atomic updates whenever possible
Critical Sections versus mutual exclusion• Critical Section objects reside in user space
• Use CRITICAL SECTION objects when visibility across process boundaries is not required
• Introduces lesser overhead
• Has a spin-wait variant that is useful for some applications
65
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Load Imbalance
Unequal work loads lead to idle threads and wasted time
Tim
e
Busy Idle
66
Copyright © 2006, Intel Corporation. All rights reserved.
Threaded Programming Methodology
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Granularity
Coarse grain
Fine grain
Parallelizable portionSerial
Parallelizable portion
Serial
Scaling: ~3X
Scaling: ~1.10X
Scaling: ~2.5X
Scaling: ~1.05X