Parallel Programming Methods and Tools Creating Parallel Code · –C++ lambda function support...
Transcript of Parallel Programming Methods and Tools Creating Parallel Code · –C++ lambda function support...
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 111/25/2009
Parallel Programming Methods and ToolsCreating Parallel Code
Hubert HaberstockTechnical Consulting Engineer
1
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2009. Intel Corporation.
2
http://intel.com/software/products
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 311/25/2009
Intel® Parallel Studio includes all three:• Intel® Parallel Composer• Intel® Parallel Inspector• Intel® Parallel Amplifier
Intel® Parallel StudioIntuitive development tools for multicore parallelism
• Microsoft Visual Studio* plug-in • End-to-end product suite for parallelism• Forward scaling to manycore
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 411/25/2009
Intel® Parallel ComposerSpeeds software development incorporating parallelism with a C/C++ compiler and comprehensive threaded libraries
Simplifies threading for improved developer productivity– “Think Parallel” and “Code Parallel” without writing the low-level thread management
Best choice because:
Easier Parallel Implementation– Included libraries and built-in capabilities reduce amount of code and
increase scalabilityBest Performance on Windows*
– Built-in optimization features and libraries extract the most from multicore processors
Interoperable– Supports variety of threading methods and compilers– Seamless integration to Microsoft Visual Studio
Part of Intel® Parallel Studio– Comprehensive solution to develop, debug and tune parallel
applications
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 511/25/2009
More than a C++ Compiler …
Simplifies threading for improved developer productivity– Simple concurrent functionality (__task/__taskcomplete)– Vectorization support for SSE2/SSE3/SSSE3/SSE4– OpenMP™ 3.0 – Intel® Parallel Debugger Extensions - A Plug-in to Visual Studio*– Intel® Threading Building Blocks – C++ lambda function support – Intel® Integrated Performance Primitives (IPP)– Parallel build (/MP) feature– /MP enables use of multiple cores to speed up build– Diagnostics to help develop parallel programs (/Qdiag-enable:thread)– Seamless integration into Microsoft Visual Studio* 2005 and 2008– Microsoft source and binary compatible
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 611/25/20096
Compatibility with Microsoft
Source and binary compatible• Full compatibility with .Net* Build Environment • Binary compatibility• Name Mangling and Calling Convention• Debug Format Compatibility• Mix and match object files or DLLsLimitations• No support of .pch file; instead use .pchi file• No support for attributed programming, managed
C++
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 711/25/2009
Intel® Performance Primitives
IPP 6.0
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 811/25/2009
Intel® IPP 6.0 Features
• Optimized library for multiple problem domains
– Image Processing– JPEG, JPEG2000– Image Search Descriptors (MPEG7), Color layout, Edge
histogram– 3D support
– Audio/Video– Codecs - H.264, MPEG4– Video enhancement - Denoise, Deinterlace, Demosaic– Microsoft* RT Audio
– Data Compression– LZO, zlib, gzip, bzip2
• Single and Multi-core processor optimizations and support
– Intel® Atom™ & Intel® Core™ processor family including Intel® Core™ i7™
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 911/25/2009
Intel® IPP Sample for Implicit Parallelism
Executable: ps_ippi.exe ps_ippm.exe –(Nx switch)Library: Intel IPP v5.3, ippiv8-5.3.dll, ippmv8-5.3.dllIntel® C++ Compiler for IA-32, Version 10.0CPU: Intel(R) Core(TM) 2 Quad processor 8x2400 MHz, L1=32/32KOS: Microsoft Windows Server 2003* (Win32)
Peformance on Matrix Function ippInvert_ma_32f,size 6x6
0
5
10
15
20
25
30
200 300 400 500 2500 5000size of matrix array
Clo
ck P
er
Ele
men
t
1 thread
2 threads
3 threads
4 threads
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1011/25/2009
(Auto) Vectorization
using the SIMD registers, let the Compiler parallelize …
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1111/25/2009
Vectorization
• For vectorization we add eight 128-bit registers known as XMM0-XMM7. For the 64-bit extensions additional eight registers XMM8-XMM15 are added
• Operations on this registers are an addition to the X86 instruction set
• The Intel® Compiler can automatically generate these instructions called SSEx (Streaming SIMD Extensions)
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1211/25/2009
Compiler Based Vectorization
/Q[a]x<code1>[,<code2>,...] –x<code1> …– generate specialized code to run exclusively on processors indicated by <code>
SSE2 Intel® Pentium 4 and com patible Intel processors
SSE3 Intel® Core™ processor fam ily with Stream ing SIM D Ext. 3
SSSE3 Intel® Core™ 2 processor fam ily with SSSE3
SSE4.1 Intel® processors supporting SSE4 Vectorizing Com piler and M edia Accelerator instructions
SSE4.2 Can generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel® Core™ i7
processors
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1311/25/2009
SIMD – SSEx support
4x floatsSSE
16x bytes
8x 16-bit shorts
4x 32-bit integers
2x 64-bit integers
1x 128-bit(!) integer
2x doubles
SSE-2
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1411/25/2009
SIMD operations
•Example vector addition, were you add two single precision, 4-component vectors together
•Using x86 operation requires four floating point additions using a FADD instructions
sum_vector [x] = vector1[x] + vector2[x];sum_vector [y] = vector1[y] + vector2[y];sum_vector [z] = vector1[z] + vector2[z];sum_vector [w] = vector1[w] + vector2[w];
•Using SSE registers a single 128 bit packed-add instruction can be used to replace the four scalar additions:
movaps xmm0, vector1 addps xmm0, vector2 movaps sum_vector,xmm0
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1511/25/200915
Using SSE3 - Your Task: Convert This…
128-bit Registers
A[0]
B[0]
C[0]
+ + + +A[1]
B[1]
C[1]
not used not used not used
not used not used not used
not used not used not used
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
e.g. 3 x 32-bit unused integers
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1611/25/200916
…Into This…
128-bit Registers
A[3] A[2]
B[3] B[2]
C[3] C[2]
+ +A[1] A[0]
B[1] B[0]
C[1] C[0]
+ +
for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
• Processor switch is vectorizing loops for fp and scalar ops.
• Usage: Linux* -xSSE3 Windows* /QxSSE3
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1711/25/2009
Auto-vectorization
• Enabled with option /Qx or /Qax
• Watch for “Loop vectorized” message in the build log
• When will it not work? – Loops need to be independent for the compiler to auto
vectorize– Use option /Qvec-report:n for detailed report on loops that
do not vectorize, where n = 1,2,3
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1811/25/200918
Example: Vectorization ReportDriver.cDriver.c(64) : (col. 2) remark: loop was not vectorized: contains unvectorizable statement at line 65.Driver.c(74) : (col. 2) remark: LOOP WAS VECTORIZED.Driver.c(39) : (col. 2) remark: LOOP WAS VECTORIZED.Driver.c(28) : (col. 10) remark: loop was not vectorized: statement cannot be vectorized.Driver.c(30) : (col. 3) remark: LOOP WAS VECTORIZED.Driver.c(12) : (col. 2) remark: loop was not vectorized: not inner loop.Driver.c(14) : (col. 14) remark: loop was not vectorized: statement cannot be vectorized.Driver.c(18) : (col. 3) remark: loop was not vectorized: not inner loop.Driver.c(19) : (col. 4) remark: LOOP WAS VECTORIZED.Multiply.cMultiply.c(7) : (col. 2) remark: loop was not vectorized: not inner loop.Multiply.c(9) : (col. 3) remark: loop was not vectorized: unsupported loop structure.-out:MatVector.exe
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 1911/25/2009
Streaming SIMD Extensions 4.2
SSE4.27 new instructions for+ faster video/media+ faster encryption/compressing+ faster XML parsing+ faster search & pattern matching
Intel® Core™ i7 (codename Nehalem)
• No new data types, use 128-bit operand similar to SSE4.1
• No new 64-bit SIMD instructions
• Functional in both 32 bit and 64 bit modes
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2011/25/2009
The future of SIMD
• Now we have SSEx with 128 bit vectors• Next is AVX with 256 bit vectors (Sandy Bridge)• Or LRBni with 512 bit vectors• …
SSESSESSESSE AVXAVXAVXAVX LRBniLRBniLRBniLRBni
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2111/25/2009
OpenMP*
The easiest parallel model …
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2211/25/2009
What is OpenMP* ?
Portable, open, shared memory multi-processing specification (API)– Fortran 77, Fortran 90, C, and C++– Multi-vendor support, for all popular operating
systems– E..g. compiler from Intel, Microsoft since VS2005 and
GNU/GCC since 4.2
• Three components: – #pragmas (compiler directives) – most important – API and Runtime Library, Environment Variables
• Combines serial and parallel code in single source– Lightweight compared to thread class, thread API– Supports incremental approach for parallelization– Ideally no need to change serial code
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2311/25/2009
Where to find out more about OpenMP*?
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE “dynamic”
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTER
C$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
http://www.openmp.orgCurrent spec is OpenMP 3.0
326 Pages
(combined C/C++ and Fortran)
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2411/25/2009
OpenMP idea
•What can we parallelize ?–Code regions
–Independent code regions–Loops
–Divide iteration space–Synchronization (!)
•Thread handling ?–Automatic (number of cores)–API
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2511/25/2009
Programming Model
Fork-join parallelism: • Master thread spawns a team of threads
• Parallelism is added incrementally: the sequential program evolves into a parallel program
Parallel Regions
Master Thread
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Confidential
11/30/09 2611/25/2009
Parallel Loop Model
• Threads are created• Data is classed as shared or private
A is shared
Parallel For
I=1
I=2
I=3
I=4
I=5
I=6
A
Iterations distributed
across threads
Barrier at end
Threads either spin or sleep between regions
void* work(float* A) {#pragma omp parallel for \ shared(A) private(i) for(i=1; i<=12; i++) { /* Iterations divided * among threads */ }}
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2711/25/2009
Parallel Sections
• Independent sections of code can execute concurrently
Serial Parallel
#pragma omp parallel sections{ #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3();}
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2811/25/2009
Data Scope Clauses
• Private– New object of the same type is created for each thread in the
team (Object is not initialized)
• Shared– Shares variables among all threads
• Default– Enables the specification of the default data-scope (Private,
Shared, or None)
• Firstprivate– Variable is initialized with the value from the original object
(superset of private)
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 2911/25/2009
Intel® Compiler and OpenMP*
• OpenMP* support– /Qopenmp– /Qopenmp_report{0|1|2} – /Qopenmp_profile
• Compatibility mode to Microsoft/GCC– Supports mixed compilation even in case OpenMP code is present
in both parts– /Qopenmp-lib:[compat|legacy]
• Additional environment variables– E.g. KMP_AFFINITY
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3011/25/2009
OpenMP and C# ?
• Encapsulate OpenMP in C++ (CLI) DLL (> VS05)• Anonym Delegates (> C# 2.0)• MSDN Webcast (B. Marquardt)
C++[…]public ref class ParExec{ public: static void For(int iStart, int iEnd, Method<int>^ loopbody) { #pragma omp parallel for for(int i = iStart; i < iEnd; i++) { loopbody(i); } }[…]C#[…] ParExec.For(0, iAnz, delegate(int i) { arr[i] = Math.Sqrt(i) + Math.Pow(i, 0.3); });[…]
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3111/25/2009
OpenMP* 3.0 Support
– 1st Compiler to be fully compliant with OpenMP* 3.0
– http://www.openmp.org/mp-documents/spec30.pdf
– Express parallelism explicitly
– Use pragmas and lib calls to describe parallel loops and tasks– New extensions:
– Tasking for unstructured parallelism
– Enhanced loop scheduling control
– Better support for nested parallelism
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3211/25/2009
OpenMP* 3.0 Tasking Example Postorder Tree Traversal
void postorder(node *p) { if (p->left) #pragma omp task postorder(p->left); if (p->right) #pragma omp task postorder(p->right); #pragma omp taskwait // wait for descendants process(p->data);}// Parent task suspended until children tasks complete
Task scheduling pointThreads may switch to execute other tasks
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3311/25/2009
Intel® Threading Building Blocks
A kind of “STL for Parallel C++ Programming”
• You specify tasks (that can run concurrently) instead of threads– Library maps user-defined logical tasks onto physical
threads, efficiently using cache and balancing load– Full support for nested parallelism
• Targets threading for scalable performance– Portable across Linux*, Mac OS*, Windows*, and Solaris*
• Emphasizes scalable data parallel programming– Loop parallelism tasks are more scalable than a fixed
number of separate tasks • Compatible with other threading packages
– Can be used in concert with native threads and OpenMP*
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3411/25/2009
Threading Building Blocks - Components
Synchronization primitivesatomic operations
various flavors of mutexes (improved)
Parallel algorithmsparallel_for (improved)
parallel_reduce (improved)parallel_do (new)
pipeline (improved)parallel_sortparallel_scan
Concurrent containersconcurrent_hash_map
concurrent_queueconcurrent_vector
(all improved)
Task schedulerWith new functionality
Memory allocatorstbb_allocator (new), cache_aligned_allocator, scalable_allocator
Utilitiestick_count
tbb_thread (new)
new: in latest version v2.1
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3511/25/2009
Thread Setup and InitializationCRITICAL_SECTION MyMutex, MyMutex2, MyMutex3;int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;}int nthreads = get_num_cpus ();HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE));InitializeCriticalSection (&MyMutex);InitializeCriticalSection (&MyMutex2);InitializeCriticalSection (&MyMutex3);for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);}for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE);}
Parallel Task Scheduling and Executionconst int MINPATCH = 150;const int DIVFACTOR = 2;typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next;} work_queue_entry_t;work_queue_entry_t *work_queue_head = NULL;work_queue_entry_t *work_queue_tail = NULL;void generate_work (patch* pchin){ int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);}} else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL;
if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; }}void generate_worklist (void){ patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch);}bool schedule_thread_work (patch &pch){ EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL);}generate_worklist ();
void parallel_thread (void *arg){ patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } }}
Focus on Application Logic, not Thread Management Example: Win32 Threads vs. TBB for 2D Ray Tracing Application
Thread Setup and Initialization#include "tbb/task_scheduler_init.h" #include "tbb/spin_mutex.h"tbb::task_scheduler_init init;tbb::spin_mutex MyMutex, MyMutex2;
Parallel Task Scheduling and Execution#include "tbb/parallel_for.h"#include "tbb/blocked_range2d.h"class parallel_task {public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {}};parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ());
Windows Threads Intel® Threading Building Blocks
Intel® Threading Building Blocks offers platform portability on Windows*, Linux*, and Mac OS* through its cross-platform API. This code comparison shows the additional code needed to make a 2D ray tracing program, Tacheon, correctly threaded. This allows the application to take advantage of current and future multi-core hardware.
This example includes software developed by John E. Stone.
Focus on work to do,not “how” (thread control)
to manage threads
Just say ‘NO’ toJust say ‘NO’ toexplicit thread managementexplicit thread management
Not because you can’t do it –Not because you can’t do it –but because it isn’t a good use ofbut because it isn’t a good use oftime developing and maintainingtime developing and maintaining
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3611/25/2009
Summary
• If native threads is the “assembly approach” of threading, then Intel TBB is the “high-level approach”– you can do more with asm, but do you really want to
bother??
• Intel® Threading Building Blocks is a C++ library– no language extensions required– works with any C++ compiler– portable between operating systems– also available as open source: http://
www.threadingbuildingblocks.org/
Flexible, scalable solution with high amount of control at minimum overhead
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3711/25/2009
Multithreading introduces new problems
• New class of problems are introduced due to the interaction between threads which are complicated, non-deterministic and therefore hard to find !
• Correctness problems (data races)
• Performance problems (contention)
• Runtime problems
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3811/25/2009
Prominent problem: Race Condition
• Suppose Global Variables– A=1, B=2
• End Result different if:– T1 runs before T2– T2 runs before T1
• Execution order is not guaranteed unless synchronization methods are used.
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 3911/25/2009
Deadlock Example
Deadlock - Both threads are now waiting each other
To fix: Both functions must acquire and release locks in the same order
Thread2
Thread1Waiting lockB to be released Waiting lockA to be released
Func1(){
Lock(A) globalX++;
Lock(B)globalY++;unlock(B);
unlock(A);}
Func2(){
Lock(B) globalY++;
Lock(A)globalX++;unlock(A);
unlock(B);}
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4011/25/2009
Thread Stalls
•Thread waits for an inordinate amount of time–Usually on a resource–Commonly caused by dangling locks
Be sure threads release all locks held
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4111/25/2009
What’s Wrong?
int data;DWORD WINAPI threadFunc(LPVOID arg){ int localData; EnterCriticalSection(&lock); if (data == DONE_FLAG) return(1); localData = data; LeaveCriticalSection(&lock); process(local_data); return(0);}
Lock Lock never never
releasedreleased
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4211/25/2009
Intel® Parallel StudioSolves 4 stages of parallel development
DESIGNGain insight on where parallelism will most benefit existing source code
CODE & DEBUGDevelop effective applications with aC/C++ compiler and comprehensive threaded libraries
VERIFYEnsure application reliability with proactive parallel memory and threading error checking
TUNEEnhance applications with easy-to-use performance analyzer and tuner
Intel® Parallel Advisor Lite – a new plug-in for Visual Studio - available at whatif.intel.com
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4311/25/2009
Intel® Parallel Debugger Extension
Enhance Microsoft Visual Studio* with OpenMP* parallel debugging
capabilities
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4411/25/2009
Intel® Parallel Debugger Extension
• Thread Data Sharing Events Detection– Break on Thread Shared Data Access (read/write)
• Function Reentrancy Detection• Serialize OpenMP*Parallel Region on the fly• OpenMP* Structure View
– Insight into thread groups, barriers, locks, wait lists etc.
• SIMD SSE Registers Window
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4511/25/2009
Common Problems with Parallel Code
• Data Sharing Violation– a thread is polluting the data from another thread– a thread accesses data another thread has modified
• Function Re-Entrancy– a thread is entering a function executed by another thread – variable and register content thrashing– ensure thread safe function prolog/epilog
• Lack of visibility of all active thread structures– which threads and which thread directives are currently active?– difficulty determining the reasons for runtime issues
• Lack of visibility of parallel data handling– How is the data processed in parallel manipulated in memory
and registers?– difficulty determining root-cause for incorrect data handling
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4611/25/2009
Intel® Parallel Debugger Extension Concepts
Visual Studio*
Debug Runtime
GUI
sGUI Extension
Debug Engine
Debugger Extension
CompilerInstrumentedApplication
Memory Acces Instrumentation Normal
debugging
RTLevents
• debug info instrumented by Intel(R) C++ Compiler – /Qopenmp /debug:parallel
• debug runtime library triggers debug exceptions – libiomp5md.dll, pdbx.dll
• Parallel Debugger Extension provides – parallelism views– user interactivity
Part of Intel® Parallel Composer
Parallel Code using OpenMP*
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4711/25/2009
Data Sharing Detection
Enable Thread Data Sharing Detection
Check Shared Data Events for possible problems
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4811/25/2009
Shared Data Events Detection - Filtering
• Data Filter• Specific data items and
variables can be excluded
• Code Filter• Functions can be
excluded• Source files can be
excluded• Address ranges can be
excluded
Data sharing detection is selective
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 4911/25/2009
Re-Entrant Call Detection
Enable Reentrant Call Detection
Check if reentrant function call may cause problemsAutomatically halts execution when a function is executed by more than one thread at any given point in time.
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5011/25/2009
Re-Entrant Call Detection
• After execution is halted and user clicks ok the program counter will point exactly to where the problem occurred.
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5111/25/2009
OpenMP* Information Windows
Tasks, Spawn Tree, Locks, Barriers, Teams, Taskwaits
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5211/25/2009
SIMD SSE Debugging Window
• SSE Registers Window • SSE Registers display of variables used for SIMD operations • Free Formatting for flexible representation of data• In-depth insight into data parallelization and vectorization.
• In-Place Edit
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5311/25/2009
Serialize Parallel Regions
• Problem: Parallel loop computes a wrong result. Is it a concurrency or algorithm issue ?
• Parallel Debug Support • Runtime access to the OpenMP num_thread property • Set to 1 for serial execution of next parallel block
• User Benefit• Verification of a algorithm “on-the-fly” without slowing
down the entire application to serial execution• On demand serial debugging without recompile/restart
…
Disable parallel
…
Enable parallel
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5411/25/2009
Intel® Parallel StudioSolves 4 stages of parallel development
DESIGNGain insight on where parallelism will most benefit existing source code
CODE & DEBUGDevelop effective applications with aC/C++ compiler and comprehensive threaded libraries
VERIFYEnsure application reliability with proactive parallel memory and threading error checking
TUNEEnhance applications with easy-to-use performance analyzer and tuner
Intel® Parallel Advisor Lite – a new plug-in for Visual Studio - available at whatif.intel.com
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5511/25/2009
Intel® Parallel Inspector
Ensure application reliability with proactive parallel memory and
threading error checking
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5611/25/2009
Race Condition
!! NondeterministicErrors !!
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5711/25/2009
VERIFY PHASE
The only combined threading and memory checker available today that helps ensure application reliability• Easily find parallel memory and all threading errors with one easy-to-use tool• Gives both experts and novices greater insight into parallel code behavior• Help ensure Windows* application reliability • Ship apps that run threading error-free
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5811/25/2009
What we can detect
• Memory Error Detection– Memory leaks, allocation errors, memory corruption, etc.
• Threaded Error Detection– All potential data race errors, if code path is taken (!)– Deadlocks
• Threaded Application Diagnosis– Finds latent (or likely to occur) errors and maps them to
the source-code line, call stack, and memory reference– Displays useful warnings for effective diagnosis,
highlighting potentially severe errors
• Works on standard debug builds
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 5911/25/2009
59
Intel® Parallel Inspector Catches location of memory errors
Reduce observations to your code
Memory leak identified with
location and size
Observations help identify problem
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6011/25/2009
Powerful Error Checking Analys is TypesMemory and Threading Errors
Memory
Locate large variety of m em ory and resource problem s including m em ory leaks, buffer overrun errors and pointer problem s
Threa
ding
Detect and predict thread-related deadlocks, data races and other synchronization problem s
Control the depth of analys is vs . collection
time
Reduce information by suppressing irrelevant results
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6111/25/2009
Intel® Parallel Inspector Navigates to the memory allocation point
Quickly identifies memory allocation
address
Quickly identifies memory allocation
addressHow did it get
here?
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6211/25/2009
Intel® Parallel Inspector Identifies threading errors
Threading problem identified with location
and status
Quick reference to the source lines
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6311/25/2009
Intel® Parallel Inspector enables C and C++ application developers to:
• Detect and predict thread-related deadlocks, data races and other synchronization problems;
• Detect potential security issues in parallel applications• Rapidly sort errors by size, frequency and type to identify and
prioritize critical problems
Threading Analys is
Level 1 The first level of analysis helps determine if the application has any deadlocks.
Level 2 The second analysis level detects if the application has any data races or deadlocks.
Level 3 Find data races and deadlocks, but additionally tries to detect where they occur.
Level 4 The fourth level of analysis tries to find all threading problems by increasing the call stack depth to 32 and by analyzing the problems on the stack
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6411/25/2009
Intel® Parallel StudioSolves 4 stages of parallel development
DESIGNGain insight on where parallelism will most benefit existing source code
CODE & DEBUGDevelop effective applications with aC/C++ compiler and comprehensive threaded libraries
VERIFYEnsure application reliability with proactive parallel memory and threading error checking
TUNEEnhance applications with easy-to-use performance analyzer and tuner
Intel® Parallel Advisor Lite – a new plug-in for Visual Studio - available at whatif.intel.com
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6511/25/2009
Intel® Parallel Amplifier
In-Depth Look at Threading to Optimize Performance
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6611/25/2009
Challenges Factors Inhibiting Scalability
• Parallel overhead– Due to thread creation, scheduling…
• Synchronization – Excessive use of shared data– Contention for the same synchronization object– Implicit synchronization
• Load imbalance– Improper distribution of parallel work
• Granularity & Scalability– Not enough parallel work
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6711/25/2009
Intel® Parallel AmplifierIntuitive Parallel Performance Analysis
Hot Spot Analysis Where is my app spending time?
Concurrency Analysis Where and When are cores idle?
Locks & Waits Analysis Where are the bad waits?
Source View See the results on your source
Compare Results Quickly see what changed
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6811/25/2009
Allows to • Locate the most time consuming routines with single mouse click• Drill down to the source code• Inspect call tree
Intel® Parallel AmplifierHotspot Analysis
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 6911/25/2009
• Build TachyonStep1 in Release mode, but with symbols (/Zi)• Analysis shows grid_intersect() as hotspot
• Browse callstack for each invocation • Shows individual contribution to
elapsed time
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7011/25/2009
• But where to introduce parallelism?• Move up in Top-down Tree to highest-level
loop
• Double-click todrill down to source
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7111/25/2009
Allows to:• Generate concurrency information with single
mouse click• Detect utilization level of cores
– locate optimization opportunities – identify load imbalance issues– avoid under-utilization
CPU Utilization
Easy to find load imbalance issues
Easy to find functions with poor utilizations
Intel® Parallel AmplifierConcurrency Analysis
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7211/25/2009
• there’s room for improvement in our primes2.c example
• 2 reasons: – static scheduling– expensive wait with “omp critical”
• after correction
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7311/25/2009
• Identify the cause of ineffective processor utilization
• Where is my program waiting on sync or I/O?– Identify locking problems that slow threaded software– Identify objects limiting parallelism– worst case: waiting at a sync object while system is
poorly utilized
How well does the application utilize the
cores?
Easily locate problematic locks
Intel® Parallel AmplifierLocks and Waits Analysis
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7411/25/2009
• Analysis shows two relevant sync objects• highest wait count
at “omp critical”
• replacing pragma omp critical with atomic drastically reduces wait count
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7511/25/2009
Starting the timings...Total time for 1000000 mutexes: 002245 Mticks. ~ 627253 msTotal time for 1000000 crit sects: 000096 Mticks. ~ 26835 msTotal time for 1000000 interlocks: 000057 Mticks. ~ 16117 msTotal time for 1000000 raw adds: 000007 Mticks. ~ 2153 ms
Cost of synchronization objects – Win32
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7611/25/2009
• Compare results after analyses• In summary and details
• detailed overview allows detecting areas of different benefit, eg regressions
Software & Services Group, Developer Products DivisionCopyright © 2009, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners. 7711/25/2009
Intel® Parallel S tudio www.intel.com/go/parallel