Threaded Programming Methodology Intel Software College.

66
Threaded Programming Methodology Intel Software College
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    3

Transcript of Threaded Programming Methodology Intel Software College.

Page 1: Threaded Programming Methodology Intel Software College.

Threaded Programming Methodology

Intel Software College

Page 2: Threaded Programming Methodology Intel Software College.

2

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Objectives

After completion of this module you will

• Be able to rapidly prototype and estimate the effort required to thread time consuming regions

Page 3: Threaded Programming Methodology Intel Software College.

3

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

A Generic Development Cycle

Case Study: Prime Number Generation

Common Performance Issues

Page 4: Threaded Programming Methodology Intel Software College.

4

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

What is Parallelism?

Two or more processes or threads execute at the same time

Parallelism for threading architectures

• Multiple processes• Communication through Inter-Process Communication (IPC)

• Single process, multiple threads• Communication through shared memory

Page 5: Threaded Programming Methodology Intel Software College.

5

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

n = number of processors

Tparallel = {(1-P) + P/n} Tserial

Speedup = Tserial / Tparallel

Amdahl’s Law

Describes the upper bound of parallel execution speedup

Serial code limits speedup

(1-P

)P

T ser

ial

(1-P

)

P/2

0.5 + 0.5 + 0.250.25

1.0/0.75 = 1.0/0.75 = 1.331.33

n = 2n = 2n = n = ∞∞

P/∞∞

0.5 + 0.5 + 0.00.0

1.0/0.5 = 1.0/0.5 = 2.02.0

Page 6: Threaded Programming Methodology Intel Software College.

6

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Processes and Threads

Modern operating systems load programs as processes

• Resource holder• Execution

A process starts executing at its entry point as a thread

Threads can create other threads within the process• Each thread gets its own stack

All threads within a process share code & data segments

Code segment

Data segment

thread

main()

…thread thread

Stack Stack

Stack

Page 7: Threaded Programming Methodology Intel Software College.

7

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Threads – Benefits & Risks

Benefits

• Increased performance and better resource utilization• Even on single processor systems - for hiding latency and

increasing throughput

• IPC through shared memory is more efficient

Risks

• Increases complexity of the application

• Difficult to debug (data races, deadlocks, etc.)

Page 8: Threaded Programming Methodology Intel Software College.

8

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Commonly Encountered Questions with Threading Applications

Where to thread?

How long would it take to thread?

How much re-design/effort is required?

Is it worth threading a selected region?

What should the expected speedup be?

Will the performance meet expectations?

Will it scale as more threads/data are added?

Which threading model to use?

Page 9: Threaded Programming Methodology Intel Software College.

9

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Prime Number Generation

bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++;

return (factor > limit);}

void FindPrimes(int start, int end){ int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}

i factor

61 3 5 7 63 365 3 567 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 577 3 5 7 79 3 5 7 9

Page 10: Threaded Programming Methodology Intel Software College.

10

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 1

Run Serial version of Prime code

• Locate PrimeSingle directory

• Compile with Intel compiler in Visual Studio

• Run a few times with different ranges

Page 11: Threaded Programming Methodology Intel Software College.

11

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Development Methodology

Analysis

• Find computationally intense code

Design (Introduce Threads)

• Determine how to implement threading solution

Debug for correctness

• Detect any problems resulting from using threads

Tune for performance

• Achieve best parallel performance

Page 12: Threaded Programming Methodology Intel Software College.

12

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Development Cycle

AnalysisAnalysis–VTune™ Performance AnalyzerVTune™ Performance Analyzer

Design (Introduce Threads)Design (Introduce Threads)–Intel® Performance libraries: IPP and MKLIntel® Performance libraries: IPP and MKL

–OpenMP* (Intel® Compiler)OpenMP* (Intel® Compiler)

–Explicit threading (Win32*, Pthreads*)Explicit threading (Win32*, Pthreads*)

Debug for correctnessDebug for correctness–Intel® Thread CheckerIntel® Thread Checker

–Intel DebuggerIntel Debugger

Tune for performanceTune for performance–Intel® Thread ProfilerIntel® Thread Profiler

–VTune™ Performance AnalyzerVTune™ Performance Analyzer

Page 13: Threaded Programming Methodology Intel Software College.

13

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Let’s use the project PrimeSingle for analysis • PrimeSingle <start> <end>

Usage: ./PrimeSingle 1 1000000

Analysis - Sampling

Use VTune Sampling to find hotspots in application

bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++;

return (factor > limit);}

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}Identifies the time consuming regions

Page 14: Threaded Programming Methodology Intel Software College.

14

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Analysis - Call Graph

This is the level in the call tree where we need to thread

Used to find proper level in Used to find proper level in the call-tree to threadthe call-tree to thread

Page 15: Threaded Programming Methodology Intel Software College.

15

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Analysis

Where to thread?

• FindPrimes()

Is it worth threading a selected region?

• Appears to have minimal dependencies

• Appears to be data-parallel

• Consumes over 95% of the run timeBaseline measurement

Page 16: Threaded Programming Methodology Intel Software College.

16

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 2

Run code with ‘1 5000000’ range to get baseline measurement

• Make note for future reference

Run VTune analysis on serial code

• What function takes the most time?

Page 17: Threaded Programming Methodology Intel Software College.

17

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Foster’s Design Methodology

From Designing and Building Parallel Programs by Ian Foster

Four Steps:

• PartitioningPartitioning• Dividing computation and data

• CommunicationCommunication• Sharing data between computations

• AgglomerationAgglomeration• Grouping tasks to improve performance

• MappingMapping• Assigning tasks to processors/threads

Page 18: Threaded Programming Methodology Intel Software College.

18

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Designing Threaded Programs

PartitionPartition

• Divide problem into tasks

CommunicateCommunicate

• Determine amount and pattern of communication

AgglomerateAgglomerate

• Combine tasks

MapMap

• Assign agglomerated tasks to created threads

TheProblem

Initial tasks

Communication

Combined Tasks

Final Program

Page 19: Threaded Programming Methodology Intel Software College.

19

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Parallel Programming Models

Functional Decomposition

• Task parallelism

• Divide the computation, then associate the data

• Independent tasks of the same problem

Data Decomposition

• Same operation performed on different data

• Divide data into pieces, then associate computation

Page 20: Threaded Programming Methodology Intel Software College.

20

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Decomposition Methods

Functional Decomposition

• Focusing on computations can reveal structure in a problem

Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC

Domain Decomposition

• Focus on largest or most frequently accessed data structure

• Data Parallelism• Same operation applied to all data

Atmosphere Model

Ocean Model

Land Surface Model

Hydrology Model

Page 21: Threaded Programming Methodology Intel Software College.

21

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Pipelined Decomposition

Computation done in independent stages

Functional decomposition

• Threads are assigned stage to compute

• Automobile assembly line

Data decomposition

• Thread processes all stages of single instance

• One worker builds an entire car

Page 22: Threaded Programming Methodology Intel Software College.

22

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

LAME Encoder Example

LAME MP3 encoder

• Open source project

• Educational tool used for learning

The goal of project is

• To improve the psychoacoustics quality

• To improve the speed of MP3 encoding

Page 23: Threaded Programming Methodology Intel Software College.

23

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

LAME Pipeline Strategy

Frame NFrame N + 1

Time

Other N

Prelude N

Acoustics N

Encoding N

T 2

T 1

Acoustics N+1

Prelude N+1

Other N+1

Encoding N+1

Acoustics N+2

Prelude N+2

T 3

T 4

Prelude N+3

Hierarchical Barrier

OtherPrelude Acoustics Encoding

FrameFetch next frameFrame characterizationSet encode parameters

Psycho AnalysisFFT long/shortFilter assemblage

Apply filteringNoise ShapingQuantize & Count bits

Add frame headerCheck correctnessWrite to disk

Page 24: Threaded Programming Methodology Intel Software College.

24

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Design

What is the expected benefit?

How do you achieve this with the least effort?

How long would it take to thread?

How much re-design/effort is required?

Rapid prototyping with OpenMPRapid prototyping with OpenMP

Speedup(2P) = 100/(96/2+4) = ~1.92X

Page 25: Threaded Programming Methodology Intel Software College.

25

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

OpenMP

Fork-join parallelism:

• Master thread spawns a team of threads as needed

• Parallelism is added incrementally• Sequential program evolves into a parallel program

Parallel Regions

Master Thread

Page 26: Threaded Programming Methodology Intel Software College.

26

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Design

#pragma omp parallel for

for( int i = start; i <= end; i+= 2 ){

if( TestForPrime(i) )

globalPrimes[gPrimesFound++] = i;

ShowProgress(i, range);

}

OpenMP

Create threads here for this parallel region

Divide iterations of the forfor loop

Page 27: Threaded Programming Methodology Intel Software College.

27

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 3

Run OpenMP version of code

• Locate PrimeOpenMP directory and solution

• Compile code

• Run with ‘1 5000000’ for comparison• What is the speedup?

Page 28: Threaded Programming Methodology Intel Software College.

28

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Design

What is the expected benefit?

How do you achieve this with the least effort?

How long would it take to thread?

How much re-design/effort is required?

Is this the best speedup possible?

Speedup of 1.40X (less than 1.92X)Speedup of 1.40X (less than 1.92X)

Page 29: Threaded Programming Methodology Intel Software College.

29

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Debugging for Correctness

Is this threaded implementation right?

No! The answers are different each time …

Page 30: Threaded Programming Methodology Intel Software College.

30

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Debugging for Correctness

Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks

IntelIntel®® Thread Checker Thread Checker

VTune™ Performance AnalyzerVTune™ Performance Analyzer

+DLLs (Instrumented)

Binary Instrumentation

Primes.exe Primes.exe (Instrumented)

Runtime Data

Collector

threadchecker.thr(result file)

Page 31: Threaded Programming Methodology Intel Software College.

31

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Checker

Page 32: Threaded Programming Methodology Intel Software College.

32

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 4

Use Thread Checker to analyze threaded application

• Create Thread Checker activity

• Run application

• Are any errors reported?

Page 33: Threaded Programming Methodology Intel Software College.

33

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Debugging for Correctness

How much re-design/effort is required?

How long would it take to thread?

Thread Checker reported only 2 dependencies, so effort required

should be low

Page 34: Threaded Programming Methodology Intel Software College.

34

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Debugging for Correctness

#pragma omp parallel for

for( int i = start; i <= end; i+= 2 ){

if( TestForPrime(i) )

#pragma omp critical

globalPrimes[gPrimesFound++] = i;

ShowProgress(i, range);

}

#pragma omp critical

{

gProgress++;

percentDone = (int)(gProgress/range *200.0f+0.5f)

}

Will create a critical section for this reference

Will create a critical section for both these references

Page 35: Threaded Programming Methodology Intel Software College.

35

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 5

Modify and run OpenMP version of code

• Add critical region pragmas to code

• Compile code

• Run from within Thread Checker• If errors still present, make appropriate fixes to code and run

again in Thread Checker

• Run with ‘1 5000000’ for comparison• Compile and run outside Thread Checker• What is the speedup?

Page 36: Threaded Programming Methodology Intel Software College.

36

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Correctness

Correct answer, but performance has slipped to ~1.331.33X

Is this the best we can expect from this algorithm?

No! From Amdahl’s Law, we expect speedup close to 1.9X

Page 37: Threaded Programming Methodology Intel Software College.

37

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Common Performance Issues

Parallel Overhead

• Due to thread creation, scheduling …

Synchronization

• Excessive use of global data, contention for the same synchronization object

Load Imbalance

• Improper distribution of parallel work

Granularity

• No sufficient parallel work

Page 38: Threaded Programming Methodology Intel Software College.

38

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Tuning for Performance

Thread Profiler pinpoints performance bottlenecks in threaded applications

Thread ProfilerThread Profiler

VTune™ Performance AnalyzerVTune™ Performance Analyzer

+DLL’s (Instrumented)

Binary Instrumentation

Primes.c

Primes.exe (Instrumented)

Runtime Data

Collector

Bistro.tp/guide.gvs(result file)

CompilerSource

Instrumentation

Primes.exe

/Qopenmp_profile

Page 39: Threaded Programming Methodology Intel Software College.

39

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler for OpenMP

Page 40: Threaded Programming Methodology Intel Software College.

40

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler for OpenMP

Speedup Graph

Estimates threading speedup and potential speedup

– Based on Amdahl’s Law computation

Gives upper and lower bound estimates

Page 41: Threaded Programming Methodology Intel Software College.

41

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler for OpenMP

serial

serial

parallel

Page 42: Threaded Programming Methodology Intel Software College.

42

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler for OpenMP

Thread 0

Thread 1

Thread 2

Thread 3

Page 43: Threaded Programming Methodology Intel Software College.

43

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler (for Explicit Threads)

Page 44: Threaded Programming Methodology Intel Software College.

44

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler (for Explicit Threads)

Why so many transitions?

Page 45: Threaded Programming Methodology Intel Software College.

45

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance

This implementation has implicit synchronization calls

This limits scaling performance due to the resulting context switches

Back to the design stage

Page 46: Threaded Programming Methodology Intel Software College.

46

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 6

Use Thread Profiler to analyze threaded application

• Use /Qopenmp_profile to compile and link

• Create Thread Profiler Activity (for explicit threads)

• Run application in Thread Profiler

• Find the source line that is causing the threads to be inactive

Page 47: Threaded Programming Methodology Intel Software College.

47

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance

Is that much contention expected?

The algorithm has many more updates than the 10 needed for showing progress

void ShowProgress( int val, int range ){ int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f);

if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone);}

void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

This change should fix the contention issue

Page 48: Threaded Programming Methodology Intel Software College.

48

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Design

Goals

• Eliminate the contention due to implicit synchronization

Speedup is 2.32X2.32X ! Is that right?Speedup is 2.32X2.32X ! Is that right?

Page 49: Threaded Programming Methodology Intel Software College.

49

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance

Our original baseline measurement had the “flawed” progress update algorithm

Is this the best we can expect from this algorithm?

Speedup is actually 1.40X 1.40X (<<1.9X)! Speedup is actually 1.40X 1.40X (<<1.9X)!

Page 50: Threaded Programming Methodology Intel Software College.

50

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 7

Modify ShowProgress function (both serial and OpenMP) to print only the needed output

• Recompile and run the code• Be sure no instrumentation flags are used

• What is speedup from serial version now?

if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }

Page 51: Threaded Programming Methodology Intel Software College.

51

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance Re-visited

Still have 62% of execution time in

locks and synchronization

Page 52: Threaded Programming Methodology Intel Software College.

52

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance Re-visited

Let’s look at the OpenMP locks…

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) )#pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}

Lock is in a loop

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}

Page 53: Threaded Programming Methodology Intel Software College.

53

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Performance Re-visited

Let’s look at the second lock

void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;

#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

This lock is also being called within a loop

void ShowProgress( int val, int range ){ long percentDone, localProgress; static int lastPercentDone = 0;

localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}

Page 54: Threaded Programming Methodology Intel Software College.

54

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 8

Modify OpenMP critical regions to use InterlockedIncrement instead

• Re-compile and run code

• What is speedup from serial version now?

Page 55: Threaded Programming Methodology Intel Software College.

55

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Thread Profiler for OpenMP

Thread 0

342 factors to test 116747

Thread 1

612 factors to test 373553

Thread 2

789 factors to test 623759

Thread 3

934 factors to test 873913

500000

250000

750000

1000000

Page 56: Threaded Programming Methodology Intel Software College.

56

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Fixing the Load Imbalance

Distribute the work more evenly

void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;

#pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}

Speedup achieved is 1.68X1.68X

Page 57: Threaded Programming Methodology Intel Software College.

57

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 9

Modify code for better load balance

• Add schedule (static, 8) clause to OpenMP parallel for pragma

• Re-compile and run code

• What is speedup from serial version now?

Page 58: Threaded Programming Methodology Intel Software College.

58

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Final Thread Profiler Run

Speedup achieved is 1.80X1.80XSpeedup achieved is 1.80X1.80X

Page 59: Threaded Programming Methodology Intel Software College.

59

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Comparative Analysis

Threading applications require multiple iterations of going through

the software development cycle

Page 60: Threaded Programming Methodology Intel Software College.

60

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Threading MethodologyWhat’s Been Covered

Four step development cycle for writing threaded code from serial and the Intel® tools that support each step

• Analysis

• Design (Introduce Threads)

• Debug for correctness

• Tune for performance

Threading applications require multiple iterations of designing, debugging and performance tuning steps

Use tools to improve productivity

Page 61: Threaded Programming Methodology Intel Software College.
Page 62: Threaded Programming Methodology Intel Software College.

62

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Backup Slides

Page 63: Threaded Programming Methodology Intel Software College.

63

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Parallel Overhead

Thread Creation overhead

• Overhead increases rapidly as the number of active threads increases

Solution

• Use of re-usable threads and thread pools• Amortizes the cost of thread creation • Keeps number of active threads relatively constant

Page 64: Threaded Programming Methodology Intel Software College.

64

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization

Heap contention• Allocation from heap causes implicit synchronization

• Allocate on stack or use thread local storage

Atomic updates versus critical sections• Some global data updates can use atomic operations (Interlocked family)

• Use atomic updates whenever possible

Critical Sections versus mutual exclusion• Critical Section objects reside in user space

• Use CRITICAL SECTION objects when visibility across process boundaries is not required

• Introduces lesser overhead

• Has a spin-wait variant that is useful for some applications

Page 65: Threaded Programming Methodology Intel Software College.

65

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Load Imbalance

Unequal work loads lead to idle threads and wasted time

Tim

e

Busy Idle

Page 66: Threaded Programming Methodology Intel Software College.

66

Copyright © 2006, Intel Corporation. All rights reserved.

Threaded Programming Methodology

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Granularity

Coarse grain

Fine grain

Parallelizable portionSerial

Parallelizable portion

Serial

Scaling: ~3X

Scaling: ~1.10X

Scaling: ~2.5X

Scaling: ~1.05X