Parallel Programming Methods and Tools Creating Parallel Code · –C++ lambda function support...

Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners. 111/25/2009

Parallel Programming Methods and ToolsCreating Parallel Code

Hubert HaberstockTechnical Consulting Engineer

1


*Other brands and names are the property of their respective owners.

Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2009. Intel Corporation.

2

http://intel.com/software/products

http://www.intel.com/software/products

http://www.intel.com/software/products



Intel® Parallel Studio includes all three:• Intel® Parallel Composer• Intel® Parallel Inspector• Intel® Parallel Amplifier

Intel® Parallel StudioIntuitive development tools for multicore parallelism

• Microsoft Visual Studio* plug-in • End-to-end product suite for parallelism• Forward scaling to manycore



Intel® Parallel ComposerSpeeds software development incorporating parallelism with a C/C++ compiler and comprehensive threaded libraries

Simplifies threading for improved developer productivity– “Think Parallel” and “Code Parallel” without writing the low-level thread management

Best choice because:

Easier Parallel Implementation– Included libraries and built-in capabilities reduce amount of code and

increase scalabilityBest Performance on Windows*

– Built-in optimization features and libraries extract the most from multicore processors

Interoperable– Supports variety of threading methods and compilers– Seamless integration to Microsoft Visual Studio

Part of Intel® Parallel Studio– Comprehensive solution to develop, debug and tune parallel

applications



More than a C++ Compiler …

Simplifies threading for improved developer productivity– Simple concurrent functionality (__task/__taskcomplete)– Vectorization support for SSE2/SSE3/SSSE3/SSE4– OpenMP™ 3.0 – Intel® Parallel Debugger Extensions - A Plug-in to Visual Studio*– Intel® Threading Building Blocks – C++ lambda function support – Intel® Integrated Performance Primitives (IPP)– Parallel build (/MP) feature– /MP enables use of multiple cores to speed up build– Diagnostics to help develop parallel programs (/Qdiag-enable:thread)– Seamless integration into Microsoft Visual Studio* 2005 and 2008– Microsoft source and binary compatible



Compatibility with Microsoft

Source and binary compatible• Full compatibility with .Net* Build Environment • Binary compatibility• Name Mangling and Calling Convention• Debug Format Compatibility• Mix and match object files or DLLsLimitations• No support of .pch file; instead use .pchi file• No support for attributed programming, managed

C++



Intel® Performance Primitives

IPP 6.0



Intel® IPP 6.0 Features

• Optimized library for multiple problem domains

– Image Processing– JPEG, JPEG2000– Image Search Descriptors (MPEG7), Color layout, Edge

histogram– 3D support

– Audio/Video– Codecs - H.264, MPEG4– Video enhancement - Denoise, Deinterlace, Demosaic– Microsoft* RT Audio

– Data Compression– LZO, zlib, gzip, bzip2

• Single and Multi-core processor optimizations and support

– Intel® Atom™ & Intel® Core™ processor family including Intel® Core™ i7™



Intel® IPP Sample for Implicit Parallelism

Executable: ps_ippi.exe ps_ippm.exe –(Nx switch)Library: Intel IPP v5.3, ippiv8-5.3.dll, ippmv8-5.3.dllIntel® C++ Compiler for IA-32, Version 10.0CPU: Intel(R) Core(TM) 2 Quad processor 8x2400 MHz, L1=32/32KOS: Microsoft Windows Server 2003* (Win32)

Peformance on Matrix Function ippInvert_ma_32f,size 6x6

0

5

10

15

20

25

30

200 300 400 500 2500 5000size of matrix array

Clo

ck P

er

Ele

men

t

1 thread

2 threads

3 threads

4 threads



(Auto) Vectorization

using the SIMD registers, let the Compiler parallelize …



Vectorization

• For vectorization we add eight 128-bit registers known as XMM0-XMM7. For the 64-bit extensions additional eight registers XMM8-XMM15 are added

• Operations on this registers are an addition to the X86 instruction set

• The Intel® Compiler can automatically generate these instructions called SSEx (Streaming SIMD Extensions)



Compiler Based Vectorization

/Q[a]x<code1>[,<code2>,...] –x<code1> …– generate specialized code to run exclusively on processors indicated by <code>

SSE2 Intel® Pentium 4 and com patible Intel processors

SSE3 Intel® Core™ processor fam ily with Stream ing SIM D Ext. 3

SSSE3 Intel® Core™ 2 processor fam ily with SSSE3

SSE4.1 Intel® processors supporting SSE4 Vectorizing Com piler and M edia Accelerator instructions

SSE4.2 Can generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core™ i7

processors



SIMD – SSEx support

4x floatsSSE

16x bytes

8x 16-bit shorts

4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

2x doubles

SSE-2



SIMD operations

•Example vector addition, were you add two single precision, 4-component vectors together

•Using x86 operation requires four floating point additions using a FADD instructions

sum_vector [x] = vector1[x] + vector2[x];sum_vector [y] = vector1[y] + vector2[y];sum_vector [z] = vector1[z] + vector2[z];sum_vector [w] = vector1[w] + vector2[w];

•Using SSE registers a single 128 bit packed-add instruction can be used to replace the four scalar additions:

movaps xmm0, vector1 addps xmm0, vector2 movaps sum_vector,xmm0



Using SSE3 - Your Task: Convert This…

128-bit Registers

A[0]

B[0]

C[0]

+ + + +A[1]

B[1]

C[1]

not used not used not used



for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

e.g. 3 x 32-bit unused integers



…Into This…

128-bit Registers

A[3] A[2]

B[3] B[2]

C[3] C[2]

+ +A[1] A[0]

B[1] B[0]

C[1] C[0]

+ +

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

• Processor switch is vectorizing loops for fp and scalar ops.

• Usage: Linux* -xSSE3 Windows* /QxSSE3



Auto-vectorization

• Enabled with option /Qx or /Qax

• Watch for “Loop vectorized” message in the build log

• When will it not work? – Loops need to be independent for the compiler to auto

vectorize– Use option /Qvec-report:n for detailed report on loops that

do not vectorize, where n = 1,2,3



Example: Vectorization ReportDriver.cDriver.c(64) : (col. 2) remark: loop was not vectorized: contains unvectorizable statement at line 65.Driver.c(74) : (col. 2) remark: LOOP WAS VECTORIZED.Driver.c(39) : (col. 2) remark: LOOP WAS VECTORIZED.Driver.c(28) : (col. 10) remark: loop was not vectorized: statement cannot be vectorized.Driver.c(30) : (col. 3) remark: LOOP WAS VECTORIZED.Driver.c(12) : (col. 2) remark: loop was not vectorized: not inner loop.Driver.c(14) : (col. 14) remark: loop was not vectorized: statement cannot be vectorized.Driver.c(18) : (col. 3) remark: loop was not vectorized: not inner loop.Driver.c(19) : (col. 4) remark: LOOP WAS VECTORIZED.Multiply.cMultiply.c(7) : (col. 2) remark: loop was not vectorized: not inner loop.Multiply.c(9) : (col. 3) remark: loop was not vectorized: unsupported loop structure.-out:MatVector.exe



Streaming SIMD Extensions 4.2

SSE4.27 new instructions for+ faster video/media+ faster encryption/compressing+ faster XML parsing+ faster search & pattern matching

Intel® Core™ i7 (codename Nehalem)

• No new data types, use 128-bit operand similar to SSE4.1

• No new 64-bit SIMD instructions

• Functional in both 32 bit and 64 bit modes



The future of SIMD

• Now we have SSEx with 128 bit vectors• Next is AVX with 256 bit vectors (Sandy Bridge)• Or LRBni with 512 bit vectors• …

SSESSESSESSE AVXAVXAVXAVX LRBniLRBniLRBniLRBni



OpenMP*

The easiest parallel model …



What is OpenMP* ?

Portable, open, shared memory multi-processing specification (API)– Fortran 77, Fortran 90, C, and C++– Multi-vendor support, for all popular operating

systems– E..g. compiler from Intel, Microsoft since VS2005 and

GNU/GCC since 4.2

• Three components: – #pragmas (compiler directives) – most important – API and Runtime Library, Environment Variables

• Combines serial and parallel code in single source– Lightweight compared to thread class, thread API– Supports incremental approach for parallelization– Ideally no need to change serial code



Where to find out more about OpenMP*?

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

http://www.openmp.orgCurrent spec is OpenMP 3.0

326 Pages

(combined C/C++ and Fortran)



OpenMP idea

•What can we parallelize ?–Code regions

–Independent code regions–Loops

–Divide iteration space–Synchronization (!)

•Thread handling ?–Automatic (number of cores)–API



Programming Model

Fork-join parallelism: • Master thread spawns a team of threads

• Parallelism is added incrementally: the sequential program evolves into a parallel program

Parallel Regions

Master Thread


*Other brands and names are the property of their respective owners.

Intel Confidential

11/30/09 2611/25/2009

Parallel Loop Model

• Threads are created• Data is classed as shared or private

A is shared

Parallel For

I=1

I=2

I=3

I=4

I=5

I=6

A

Iterations distributed

across threads

Barrier at end

Threads either spin or sleep between regions

void* work(float* A) {#pragma omp parallel for \ shared(A) private(i) for(i=1; i<=12; i++) { /* Iterations divided * among threads */ }}



Parallel Sections

• Independent sections of code can execute concurrently

Serial Parallel

#pragma omp parallel sections{ #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3();}



Data Scope Clauses

• Private– New object of the same type is created for each thread in the

team (Object is not initialized)

• Shared– Shares variables among all threads

• Default– Enables the specification of the default data-scope (Private,

Shared, or None)

• Firstprivate– Variable is initialized with the value from the original object

(superset of private)



Intel® Compiler and OpenMP*

• OpenMP* support– /Qopenmp– /Qopenmp_report{0|1|2} – /Qopenmp_profile

• Compatibility mode to Microsoft/GCC– Supports mixed compilation even in case OpenMP code is present

in both parts– /Qopenmp-lib:[compat|legacy]

• Additional environment variables– E.g. KMP_AFFINITY



OpenMP and C# ?

• Encapsulate OpenMP in C++ (CLI) DLL (> VS05)• Anonym Delegates (> C# 2.0)• MSDN Webcast (B. Marquardt)

C++[…]public ref class ParExec{ public: static void For(int iStart, int iEnd, Method<int>^ loopbody) { #pragma omp parallel for for(int i = iStart; i < iEnd; i++) { loopbody(i); } }[…]C#[…] ParExec.For(0, iAnz, delegate(int i) { arr[i] = Math.Sqrt(i) + Math.Pow(i, 0.3); });[…]



OpenMP* 3.0 Support

– 1st Compiler to be fully compliant with OpenMP* 3.0

– http://www.openmp.org/mp-documents/spec30.pdf

– Express parallelism explicitly

– Use pragmas and lib calls to describe parallel loops and tasks– New extensions:

– Tasking for unstructured parallelism

– Enhanced loop scheduling control

– Better support for nested parallelism

http://www.openmp.org/mp-documents/spec30_draft.pdf



OpenMP* 3.0 Tasking Example Postorder Tree Traversal

void postorder(node *p) { if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants process(p->data);}// Parent task suspended until children tasks complete

Task scheduling pointThreads may switch to execute other tasks



Intel® Threading Building Blocks

A kind of “STL for Parallel C++ Programming”

• You specify tasks (that can run concurrently) instead of threads– Library maps user-defined logical tasks onto physical

threads, efficiently using cache and balancing load– Full support for nested parallelism

• Targets threading for scalable performance– Portable across Linux*, Mac OS*, Windows*, and Solaris*

• Emphasizes scalable data parallel programming– Loop parallelism tasks are more scalable than a fixed

number of separate tasks • Compatible with other threading packages

– Can be used in concert with native threads and OpenMP*



Threading Building Blocks - Components

Synchronization primitivesatomic operations

various flavors of mutexes (improved)

Parallel algorithmsparallel_for (improved)

parallel_reduce (improved)parallel_do (new)

pipeline (improved)parallel_sortparallel_scan

Concurrent containersconcurrent_hash_map

concurrent_queueconcurrent_vector

(all improved)

Task schedulerWith new functionality

Memory allocatorstbb_allocator (new), cache_aligned_allocator, scalable_allocator

Utilitiestick_count

tbb_thread (new)

new: in latest version v2.1



Thread Setup and InitializationCRITICAL_SECTION MyMutex, MyMutex2, MyMutex3;int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;}int nthreads = get_num_cpus ();HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE));InitializeCriticalSection (&MyMutex);InitializeCriticalSection (&MyMutex2);InitializeCriticalSection (&MyMutex3);for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);}for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE);}

Parallel Task Scheduling and Executionconst int MINPATCH = 150;const int DIVFACTOR = 2;typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next;} work_queue_entry_t;work_queue_entry_t *work_queue_head = NULL;work_queue_entry_t *work_queue_tail = NULL;void generate_work (patch* pchin){ int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);}} else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL;

if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; }}void generate_worklist (void){ patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch);}bool schedule_thread_work (patch &pch){ EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL);}generate_worklist ();

void parallel_thread (void *arg){ patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } }}

Focus on Application Logic, not Thread Management Example: Win32 Threads vs. TBB for 2D Ray Tracing Application

Thread Setup and Initialization#include "tbb/task_scheduler_init.h" #include "tbb/spin_mutex.h"tbb::task_scheduler_init init;tbb::spin_mutex MyMutex, MyMutex2;

Parallel Task Scheduling and Execution#include "tbb/parallel_for.h"#include "tbb/blocked_range2d.h"class parallel_task {public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {}};parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ());

Windows Threads Intel® Threading Building Blocks

Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware.

This example includes software developed by John E. Stone.

Focus on work to do,not “how” (thread control)

to manage threads

Just say ‘NO’ toJust say ‘NO’ toexplicit thread managementexplicit thread management

Not because you can’t do it –Not because you can’t do it –but because it isn’t a good use ofbut because it isn’t a good use oftime developing and maintainingtime developing and maintaining



Summary

• If native threads is the “assembly approach” of threading, then Intel TBB is the “high-level approach”– you can do more with asm, but do you really want to

bother??

• Intel® Threading Building Blocks is a C++ library– no language extensions required– works with any C++ compiler– portable between operating systems– also available as open source: http://

www.threadingbuildingblocks.org/

Flexible, scalable solution with high amount of control at minimum overhead

http://www.threadingbuildingblocks.org/





Multithreading introduces new problems

• New class of problems are introduced due to the interaction between threads which are complicated, non-deterministic and therefore hard to find !

• Correctness problems (data races)

• Performance problems (contention)

• Runtime problems



Prominent problem: Race Condition

• Suppose Global Variables– A=1, B=2

• End Result different if:– T1 runs before T2– T2 runs before T1

• Execution order is not guaranteed unless synchronization methods are used.



Deadlock Example

Deadlock - Both threads are now waiting each other

To fix: Both functions must acquire and release locks in the same order

Thread2

Thread1Waiting lockB to be released Waiting lockA to be released

Func1(){

Lock(A) globalX++;

Lock(B)globalY++;unlock(B);

unlock(A);}

Func2(){

Lock(B) globalY++;

Lock(A)globalX++;unlock(A);

unlock(B);}



Thread Stalls

•Thread waits for an inordinate amount of time–Usually on a resource–Commonly caused by dangling locks

Be sure threads release all locks held



What’s Wrong?

int data;DWORD WINAPI threadFunc(LPVOID arg){ int localData; EnterCriticalSection(&lock); if (data == DONE_FLAG) return(1); localData = data; LeaveCriticalSection(&lock); process(local_data); return(0);}

Lock Lock never never

releasedreleased



Intel® Parallel StudioSolves 4 stages of parallel development

DESIGNGain insight on where parallelism will most benefit existing source code

CODE & DEBUGDevelop effective applications with aC/C++ compiler and comprehensive threaded libraries

VERIFYEnsure application reliability with proactive parallel memory and threading error checking

TUNEEnhance applications with easy-to-use performance analyzer and tuner

Intel® Parallel Advisor Lite – a new plug-in for Visual Studio - available at whatif.intel.com



Intel® Parallel Debugger Extension

Enhance Microsoft Visual Studio* with OpenMP* parallel debugging

capabilities



Intel® Parallel Debugger Extension

• Thread Data Sharing Events Detection– Break on Thread Shared Data Access (read/write)

• Function Reentrancy Detection• Serialize OpenMP*Parallel Region on the fly• OpenMP* Structure View

– Insight into thread groups, barriers, locks, wait lists etc.

• SIMD SSE Registers Window



Common Problems with Parallel Code

• Data Sharing Violation– a thread is polluting the data from another thread– a thread accesses data another thread has modified

• Function Re-Entrancy– a thread is entering a function executed by another thread – variable and register content thrashing– ensure thread safe function prolog/epilog

• Lack of visibility of all active thread structures– which threads and which thread directives are currently active?– difficulty determining the reasons for runtime issues

• Lack of visibility of parallel data handling– How is the data processed in parallel manipulated in memory

and registers?– difficulty determining root-cause for incorrect data handling



Intel® Parallel Debugger Extension Concepts

Visual Studio*

Debug Runtime

GUI

sGUI Extension

Debug Engine

Debugger Extension

CompilerInstrumentedApplication

Memory Acces Instrumentation Normal

debugging

RTLevents

• debug info instrumented by Intel(R) C++ Compiler – /Qopenmp /debug:parallel

• debug runtime library triggers debug exceptions – libiomp5md.dll, pdbx.dll

• Parallel Debugger Extension provides – parallelism views– user interactivity

Part of Intel® Parallel Composer

Parallel Code using OpenMP*



Data Sharing Detection

Enable Thread Data Sharing Detection

Check Shared Data Events for possible problems



Shared Data Events Detection - Filtering

• Data Filter• Specific data items and

variables can be excluded

• Code Filter• Functions can be

excluded• Source files can be

excluded• Address ranges can be

excluded

Data sharing detection is selective



Re-Entrant Call Detection

Enable Reentrant Call Detection

Check if reentrant function call may cause problemsAutomatically halts execution when a function is executed by more than one thread at any given point in time.



Re-Entrant Call Detection

• After execution is halted and user clicks ok the program counter will point exactly to where the problem occurred.



OpenMP* Information Windows

Tasks, Spawn Tree, Locks, Barriers, Teams, Taskwaits



SIMD SSE Debugging Window

• SSE Registers Window • SSE Registers display of variables used for SIMD operations • Free Formatting for flexible representation of data• In-depth insight into data parallelization and vectorization.

• In-Place Edit



Serialize Parallel Regions

• Problem: Parallel loop computes a wrong result. Is it a concurrency or algorithm issue ?

• Parallel Debug Support • Runtime access to the OpenMP num_thread property • Set to 1 for serial execution of next parallel block

• User Benefit• Verification of a algorithm “on-the-fly” without slowing

down the entire application to serial execution• On demand serial debugging without recompile/restart

…

Disable parallel

…

Enable parallel



Intel® Parallel Inspector

Ensure application reliability with proactive parallel memory and

threading error checking



Race Condition

!! NondeterministicErrors !!



VERIFY PHASE

The only combined threading and memory checker available today that helps ensure application reliability• Easily find parallel memory and all threading errors with one easy-to-use tool• Gives both experts and novices greater insight into parallel code behavior• Help ensure Windows* application reliability • Ship apps that run threading error-free



What we can detect

• Memory Error Detection– Memory leaks, allocation errors, memory corruption, etc.

• Threaded Error Detection– All potential data race errors, if code path is taken (!)– Deadlocks

• Threaded Application Diagnosis– Finds latent (or likely to occur) errors and maps them to

the source-code line, call stack, and memory reference– Displays useful warnings for effective diagnosis,

highlighting potentially severe errors

• Works on standard debug builds



59

Intel® Parallel Inspector Catches location of memory errors

Reduce observations to your code

Memory leak identified with

location and size

Observations help identify problem



Powerful Error Checking Analys is TypesMemory and Threading Errors

Memory

Locate large variety of m em ory and resource problem s including m em ory leaks, buffer overrun errors and pointer problem s

Threa

ding

Detect and predict thread-related deadlocks, data races and other synchronization problem s

Control the depth of analys is vs . collection

time

Reduce information by suppressing irrelevant results



Intel® Parallel Inspector Navigates to the memory allocation point

Quickly identifies memory allocation

address

Quickly identifies memory allocation

addressHow did it get

here?



Intel® Parallel Inspector Identifies threading errors

Threading problem identified with location

and status

Quick reference to the source lines



Intel® Parallel Inspector enables C and C++ application developers to:

• Detect and predict thread-related deadlocks, data races and other synchronization problems;

• Detect potential security issues in parallel applications• Rapidly sort errors by size, frequency and type to identify and

prioritize critical problems

Threading Analys is

Level 1 The first level of analysis helps determine if the application has any deadlocks.

Level 2 The second analysis level detects if the application has any data races or deadlocks.

Level 3 Find data races and deadlocks, but additionally tries to detect where they occur.

Level 4 The fourth level of analysis tries to find all threading problems by increasing the call stack depth to 32 and by analyzing the problems on the stack



Intel® Parallel Amplifier

In-Depth Look at Threading to Optimize Performance



Challenges Factors Inhibiting Scalability

• Parallel overhead– Due to thread creation, scheduling…

• Synchronization – Excessive use of shared data– Contention for the same synchronization object– Implicit synchronization

• Load imbalance– Improper distribution of parallel work

• Granularity & Scalability– Not enough parallel work



Intel® Parallel AmplifierIntuitive Parallel Performance Analysis

Hot Spot Analysis Where is my app spending time?

Concurrency Analysis Where and When are cores idle?

Locks & Waits Analysis Where are the bad waits?

Source View See the results on your source

Compare Results Quickly see what changed



Allows to • Locate the most time consuming routines with single mouse click• Drill down to the source code• Inspect call tree

Intel® Parallel AmplifierHotspot Analysis



• Build TachyonStep1 in Release mode, but with symbols (/Zi)• Analysis shows grid_intersect() as hotspot

• Browse callstack for each invocation • Shows individual contribution to

elapsed time



• But where to introduce parallelism?• Move up in Top-down Tree to highest-level

loop

• Double-click todrill down to source



Allows to:• Generate concurrency information with single

mouse click• Detect utilization level of cores

– locate optimization opportunities – identify load imbalance issues– avoid under-utilization

CPU Utilization

Easy to find load imbalance issues

Easy to find functions with poor utilizations

Intel® Parallel AmplifierConcurrency Analysis



• there’s room for improvement in our primes2.c example

• 2 reasons: – static scheduling– expensive wait with “omp critical”

• after correction



• Identify the cause of ineffective processor utilization

• Where is my program waiting on sync or I/O?– Identify locking problems that slow threaded software– Identify objects limiting parallelism– worst case: waiting at a sync object while system is

poorly utilized

How well does the application utilize the

cores?

Easily locate problematic locks

Intel® Parallel AmplifierLocks and Waits Analysis



• Analysis shows two relevant sync objects• highest wait count

at “omp critical”

• replacing pragma omp critical with atomic drastically reduces wait count



Starting the timings...Total time for 1000000 mutexes: 002245 Mticks. ~ 627253 msTotal time for 1000000 crit sects: 000096 Mticks. ~ 26835 msTotal time for 1000000 interlocks: 000057 Mticks. ~ 16117 msTotal time for 1000000 raw adds: 000007 Mticks. ~ 2153 ms

Cost of synchronization objects – Win32



• Compare results after analyses• In summary and details

• detailed overview allows detecting areas of different benefit, eg regressions



Intel® Parallel S tudio www.intel.com/go/parallel

Parallel Programming Methods and Tools Creating Parallel Code · –C++ lambda function support...

Documents

Transcript of Parallel Programming Methods and Tools Creating Parallel Code · –C++ lambda function support...