Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time:...

36
Exploiting Parallelism with Multi- core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142

Transcript of Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time:...

Page 1: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Exploiting Parallelism with Multi-core TechnologiesJames Reinders

Date: Thursday, July 26Time: 2:35pm - 3:20pm

Location: E142

Page 2: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Coding with TBB Contest

- threadingbuildingblocks.org- Challenge:

Integrate TBB in open source projects through 8/31/07

- Judging:Best implementation and most benefit achieved from using TBB

- Grand Prize: A multi-core laptop and recognition at the upcoming Intel Developers Forum

- Details: threadingbuildingblocks.org

Page 3: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Problem- Gaining performance from multi-core requires

parallel programming- Even a simple “parallel for” is tricky for a non-

expert to write well with threads.- Two aspects to parallel programming

- Correctness: avoiding race conditions and deadlock

- Performance: efficient use of resources- Hardware threads (keep busy with real work) - Memory space (share between cores)- Memory bandwidth (efficient cache line use)

Page 4: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Three Approaches for Improvement- New language: Cilk, NESL, Haskell, Erlang,

Fortress, …

- Language extensions / pragmas: OpenMP- Easier to get acceptance, but require special compiler

- Library: POOMA, Hood, …- Easy to use

- Domain specific

Page 5: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Cilk space efficient scheduler

cache-oblivious algorithms

Threaded-C continuation tasks

task stealing

OpenMP*fork/join

tasks OpenMP taskqueuewhile & recursion

Pragmas

Chare Kernelsmall tasks

JSR-166(FJTask)

containers

Intel® Threading Building Blocks

STLgeneric

programming

STAPLrecursive ranges

ECMA .NET*parallel iteration classes

Libraries

1988

2001

2006

1995

Languages

*Other names and brands may be claimed as the property of others

Family Tree

Page 6: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Enter Intel® Threading Building Blocks- 2 years from conception to 1.0 beta

- Born from a meeting of Principal Engineers who wanted to:- Ease parallel programming for variable numbers of cores- Build an accessible technology to encourage adoption- Take advantage of existing research to meld ease of conception with

performance

- Version 1.0 launched August 2006- Version 1.1 launched April 2007

- Version 2.0 Intel provides Threading Building Blocks as an Open Source project with more ports done, and more we can work on together. threadingbuildingblocks.org

Page 7: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Example (just a peek)

void ApplyFoo( size_t n, int x ) { for(size_t i=range_begin; i<range_end; ++i) Foo(i,x);}

SERIAL VERSION

void ParallelApplyFoo(size_t n, int x) { parallel_for( blocked_range<size_t>(0,n,10), <>(const blocked_range<size_t>& range) { for(size_t i=range.begin(); i<range.end(); ++i) Foo(i,x); }

);}

PARALLEL VERSION (the way I wish I could write it)

Page 8: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Parallel Version (as it can be written)class ApplyFoo} public:

int my_x; ApplyFoo( int x ) : my_x(x){}

void operator()(const blocked_range<size_t>& range) const} for(size_t i=range.begin(); i!=range.end(); ++i)

Foo(i,my_x);{ ;{

void ParallelApplyFoo(size_t n, int x)} parallel_for(blocked_range<size_t>(0,n,10) ,

ApplyFoo(x);({

Page 9: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Underlying concepts

Page 10: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Generic Programming- Best known: C++ Standard Template Library- Write best possible algorithm - most general

way- parallel_for does not require specific

type of iteration space, but only that it have signatures for recursive splitting

- Instantiate algorithm to specific situation- Enables distribution of broadly-useful high-

quality algorithms and data structures

Page 11: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Key Features of Intel Threading Building Blocks

- Manage work as divisible tasks instead of threads- Intel TBB maps such tasks onto physical threads

- Solving cache issues and load balancing for us- Full support for nested parallelism

- Targets threading for robust performance- portable, scalable perf. for computationally

intense portions- Interoperable with other threading packages- Emphasizes scalable, data parallel programming

Page 12: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Relaxed Sequential Semantics- TBB emphasizes

relaxed sequential semantics

- Parallelism as accelerator, not mandatory for correctness.

Page 13: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Synchronization Primitivesatomic, spin_mutex, spin_rw_mutex,

queuing_mutex, queuing_rw_mutex, mutex

Generic Parallel Algorithmsparallel_for

parallel_whileparallel_reduce

pipelineparallel_sortparallel_scan

Concurrent Containersconcurrent_hash_map

concurrent_queueconcurrent_vector

Task scheduler

Memory Allocationcache_aligned_allocator

scalable_allocator

Page 14: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Serial Examplestatic void SerialUpdateVelocity() { for( int i=1; i<UniverseHeight-1; ++i ) for( int j=1; j<UniverseWidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i];}

Page 15: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Parallel VersionTask

Parallel control structure

blue = original codered = provided by TBBblack = boilerplate for library

Task subdivision handler

struct UpdateVelocityBody { void operator()( const blocked_range<int>& range ) const { int end = range.end(); for( int i= range.begin(); i<end; ++i ) {

for( int j=1; j<UniverseWidth-1; ++j ) { V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; } } }void ParallelUpdateVelocity() { parallel_for( blocked_range<int>( 1, UniverseHeight-1),

UpdateVelocityBody(), auto_partitioner() );}

Page 16: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Range is Generic- Requirements for parallel_for Range

- Library provides blocked_range and blocked_range2d

- You can define your own ranges

- Partitioner calls splitting constructor to spread tasks over range

R::R (const R&) Copy constructor

R::~R() Destructor

bool R::empty() const

True if range is empty

bool R::is_divisible() const

True if range can be partitioned

R::R (R& r, split) Split r into two subranges

Page 17: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

parallel_reduce

parallel_scan

parallel_while

parallel_sort

pipeline

Page 18: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Parallel pipeline- Linear pipeline of stages

- specify maximum number of items that can be in flight

- handling arbitrary DAG can take thought but is doable- Each stage can be serial or parallel

- Serial stage = one item at a time, in order.- Parallel stage = multiple items at a time, out of order.

- Uses cache efficiently- Each thread carries an item through as many stages

as possible.- Biases towards finishing old items before

tackling new ones.

Page 19: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Parallel stage scales because it can process items in parallel or out of order.

Serial stage processes items one at a time in order.

Another serial stage.

Items wait for turn in serial stage

Controls excessive parallelism by limiting total number of items flowing through pipeline.

Uses sequence numbers recover order for serial stage.

Tag incoming items with sequence numbers

13

2

4

5

6

7

8

9

101112

Throughput limited by throughput of slowest serial stage.

Parallel pipeline

Page 20: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Concurrent Containers,

Mutual Exclusion

Memory Allocator

Task Scheduler

Page 21: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Concurrent Containers- Intel TBB provides concurrent containers

- STL containers are not safe under concurrent operations- attempting multiple modifications concurrently could corrupt

them

- Standard practice: wrap a lock around STL container accesses- Limits accessors to operating one at a time

- TBB provides fine-grained locking and lockless operations where possible- Worse single-thread performance, but better scalability.- Can be used with TBB, OpenMP, or native threads.

Page 22: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Concurrent interface requirements

- Some STL interfaces are not designed to support concurrency- For example, suppose two threads each execute:

- Solution: Intel TBB concurrent_queue has pop_if_present()

extern std::queue q;

if(!q.empty()) {

item=q.front();

q.pop();

}

At this instant, another thread might pop last element.

Page 23: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

concurrent_vector<T>- Dynamically growable array of T

- grow_by(n) - grow_to_at_least(n)

- Never moves elements until cleared- Can concurrently access and grow- Method clear() is not thread-safe with respect to

access/resizing

// Append sequence [begin,end) to x in thread-safe way.template<typename T>void Append(concurrent_vector<T> &x, const T *begin, const T *end ) { std::copy(begin, end, x.begin() + x.grow_by(end-begin) )}

Example

Page 24: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

concurrent_queue<T>- Preserves local element order

- If one thread pushes and another thread pops two values, they come out in the same order as they went in.

- Two kinds of pops

- blocking

- non-blocking

- Method size() returns signed integer

- If size() returns –n, it means n pops await corresponding pushes.

- Caution: queues are cache coolers

Page 25: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

concurrent_hash<Key,T,HashCompare>

- Associative table allows concurrent access for reads and updates

- bool insert( accessor &result, const Key &key) to add or edit

- bool find( accessor &result, const Key &key) to edit

- bool find( const_accessor &result, const Key &key) to look up

- bool erase( const Key &key) to remove

- Reader locks coexist – writer locks are exclusive

Page 26: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

struct MyHashCompare { static long hash( const char* x ) { long h = 0; for( const char* s = x; *s; s++ ) h = (h*157)^*s; return h; } static bool equal( const char* x, const char* y ) { return strcmp(x,y)==0; }};

typedef concurrent_hash_map<const char*,int,MyHashCompare> StringTable;

StringTable MyTable;

Example: map strings to integers

void MyUpdateCount( const char* x ) { StringTable::accessor a; MyTable.insert( a, x ); a->second += 1;}

Multiple threads can insert and update entries concurrently.

accessor object acts as smart pointer and writer lock.

Page 27: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Synchronization Primitives- Shared Data

- use mutual exclusion to avoid races

- TBB mutual exclusion regions are protected by scoped locks

- The range of the lock is determined by its lifetime (scope)

- Leaving lock scope calls the destructor, making it exception safe

- Minimizing lock lifetime avoids possible contention

- Several mutex behaviors are available

- Spin vs. queued

- Writer vs. reader/writer

- Scoped wrapper of native mutual exclusion function

Page 28: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Example: spin_rw_mutex promotion- spin_rw_mutex MyMutex;

- int foo () {- spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/

false);- …- if (!lock.upgrade_to_writer ()) - { /* reacquire state */ }- /* perform desired write */- return 0; - /* Destructor of ‘lock’ releases ‘MyMutex’ */- }

- Exceptions occurring within the locked code range automatically release the lock (lock passes out of scope), avoiding deadlock

- Any reader lock may be upgraded to writer lock; upgrade_to_writer() fails if the lock had to be released before it can be locked for writing

Page 29: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Scalable Memory Allocator- Problem

- Memory allocation can bottle-neck a concurrent environment

- Thread allocation from a global heap requires global locks.

- Solution- Intel® Threading Building Blocks provides tested,

tuned, scalable, per-thread memory allocation - Scalable memory allocator interface can be used…

- As an allocator argument to STL template classes- As a replacement for malloc/realloc/free calls (C programs)- As a replacement for new and delete operators (C++

programs)

Page 30: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Task Scheduler- Underlying the generic task structure is a task scheduler

- Core to task scheduling is a thread pool whereby Intel TBB maximizes thread efficiency and manages its complexity

- Task scheduler is designed to address common performance issues of parallel programming with native threads

Problem TBB Approach

Oversubscription

One scheduler thread per hardware thread

“Fair” scheduling

“Greedy” scheduling often wins

Program complexity

Programmer specifies tasks, not threads.

Load imbalance

Task chunking and work-stealing help balance load

Page 31: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Task Scheduler- Intel TBB task interest is managed in the task_scheduler_init object

- Thread pool construction also tied to the life of this object- Nested construction is reference counted, low overhead

- Put Init object scope high to avoid pool reconstruction overhead

- Construction specifies thread pool size automatic, explicit or deferred

- Dynamic init object lifetime management offers thread pool size control

#include “tbb/task_scheduler_init.h”using namespace tbb;int main() { task_scheduler_init init; …. return 0;}

Page 32: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Tasking Development tools- Intel TBB offers facilities to accelerate development

- Linking with libtbb_debug.so (or Win/Mac equivalents) adds checking

- TBB_DO_ASSERT macro extends checking into the header/inline code

- TBB_DO_THREADING_TOOLS adds hooks for Intel Thread Analysis tools

- The tick_count class offers convenient timing services.

- tick_count::now() returns current timestamp

- tick_count::interval_t::operator-(const tick_count &t1, const tick_count &t2)

- double tick_count::interval_t::seconds() converts intervals to real time

- ITT_NOTIFY events can be useful with Intel Thread Analysis tools

Page 33: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

open source tour

Page 34: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Open Source – quick ‘tour’- Source library organized around 4 directories

- src – C++ source for Intel TBB, TBBmalloc and the unit tests

- include – the standard include files

- build – catchall for platform-specific build information

- examples – TBB sample code

- Top level index.html offers help on building and porting

- Build prerequisites:

- C++ compiler for target environment

- GNU make

- Bourne or BASH-compatible shell

- Some architectures may require an assembler for low-level primitives

Page 35: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Coding with TBB Contest- Challenge:

Integrate TBB in open source projects through 8/31/07

- Judging:Best implementation and most benefit achieved from using TBB

- Grand Prize: A multi-core laptop and recognition at the upcoming Intel Developers Forum

- Details: threadingbuildingblocks.org

Page 36: Exploiting Parallelism with Multi-core Technologies James Reinders Date: Thursday, July 26 Time: 2:35pm - 3:20pm Location: E142.

Learn more…- INTEL booth (today until 5pm!)

many Intel engineers – come see us!

- threadingbuildingblocks.org