PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on...
Transcript of PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on...
![Page 1: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/1.jpg)
Joachim
Nitschke
PARALLEL
PROGRAMMING
Project Seminar “Parallel Programming”, Summer Semester 2011
![Page 2: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/2.jpg)
Introduction
Parallel program design
Patterns for parallel programming
A: Algorithm structure
B: Supporting structures
CONTENT
2
![Page 3: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/3.jpg)
Context
around
parallel
programming
INTRODUCTION
![Page 4: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/4.jpg)
Many different models reflecting the various
different parallel hardware architectures
2 or rather 3 most common models:
Shared memory
Distributed memory
Hybrid models (combining shared and distributed memory)
PARALLEL PROGRAMMING MODELS
4
![Page 5: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/5.jpg)
Shared memory Distributed memory
PARALLEL PROGRAMMING MODELS
5
![Page 6: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/6.jpg)
Shared memory
Synchronize memory
access
Locking vs. potential
race conditions
Distributed memory
Communication
bandwidth and
resulting latency
Manage message
passing
Synchronous vs.
asynchronous
communication
PROGRAMMING CHALLENGES
6
![Page 7: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/7.jpg)
2 common standards as examples for the 2
parallel programming models:
Open Multi-Processing (OpenMP)
Message passing interface (MPI)
PARALLEL PROGRAMMING STANDARDS
7
![Page 8: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/8.jpg)
Collection of libraries and compiler directives for parallel programming on shared memory computers
Programmers have to explicitly designate blocks that are to run in parallel by adding directives like:
OpenMP then creates a number of threads executing the designated code block
OpenMP
8
![Page 9: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/9.jpg)
Library with routines to manage message passing
for programming on distributed memory computers
Messages are sent from one process to another
Routines for synchronization, broadcasts, blocking
and non blocking communication
MPI
9
![Page 10: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/10.jpg)
MPI.Scatter MPI.Gather
MPI EXAMPLE
10
![Page 11: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/11.jpg)
General
strategies
for finding
concurrency
PARALLEL PROGRAM
DESIGN
![Page 12: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/12.jpg)
General approach: Analyze a problem to identify
exploitable concurrency
Main concept is decomposition : Divide a
computation into smaller parts all or some of
which can run concurrently
FINDING CONCURRENCY
12
![Page 13: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/13.jpg)
Tasks: Programmer-defined units into which the
main computation is decomposed
Unit of execution (UE) : Generalization of processes
and threads
SOME TERMINOLOGY
13
![Page 14: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/14.jpg)
Decompose a problem into tasks that can run
concurrently
Few large tasks vs. many small tasks
Minimize dependencies among tasks
TASK DECOMPOSITION
14
![Page 15: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/15.jpg)
Group tasks to simplify managing their dependencies
Tasks within a group run at the same time
Based on decomposition: Group tasks that belong to the same high-level operations
Based on constraints: Group tasks with the same constraints
GROUP TASKS
15
![Page 16: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/16.jpg)
Order task groups to satisfy constraints among
them
Order must be:
Restrictive enough to satisfy constraints
Not too restrictive to improve flexibility and hence efficiency
Identify dependencies – e.g.:
Group A requires data from group B
Important: Also identify the independent groups
Identify potential dead locks
ORDER TASKS
16
![Page 17: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/17.jpg)
Decompose a problem‘s data into units that can be operated on relatively independent
Look at problem‘s central data structures
Decomposition already implied by or basis for task decomposition
Again: Few large chunks vs. many small chunks
Improve flexibility: Configurable granularity
DATA DECOMPOSITION
17
![Page 18: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/18.jpg)
Share decomposed data among tasks
Identify task-local and shared data
Classify shared data: read/write or read only?
Identify potential race conditions
Note: Sometimes data sharing implies communication
DATA SHARING
18
![Page 19: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/19.jpg)
Typical
parallel
program
structures
PATTERNS FOR
PARALLEL
PROGRAMMING
![Page 20: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/20.jpg)
How can the identified concurrency be used to
build a program?
3 examples for typical parallel algorithm
structures:
Organize by tasks: Divide & conquer
Organize by data decomposition: Geometric/domain
decomposition
Organize by data flow: Pipeline
A: ALGORITHM STRUCTURE
20
![Page 21: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/21.jpg)
Principle: Split a problem recursively into smaller
solvable sub problems and merge their results
Potential concurrency: Sub problems can be solved
simultaneously
DIVIDE & CONQUER
21
![Page 22: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/22.jpg)
Precondition: Sub problems can be solved
independently
Efficiency constraint: Split and merge should be
trivial compared to sub problems
Challenge: Standard base case can lead to too
many too small tasks
End recursion earlier?
DIVIDE & CONQUER
22
![Page 23: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/23.jpg)
Principle: Organize an algorithm around a linear
data structure that was decomposed into
concurrently updatable chunks
Potential concurrency: Chunks can be updated
simultaneously
GEOMETRIC/DOMAIN DECOMPOSITION
23
![Page 24: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/24.jpg)
Example: Simple blur filter
where every pixel is set to
the average value of its
surrounding pixels
Image can be split into
squares
Each square is updated by a
task
To update square border
information from other
squares is required
GEOMETRIC/DOMAIN DECOMPOSITION
24
![Page 25: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/25.jpg)
Again: Granularity of decomposition?
Choose square/cubic chunks to minimize surface
and thus nonlocal data
Replicating nonlocal data can reduce
communication → “ghost boundaries”
Optimization: Overlap update and exchange of
nonlocal data
Number of tasks > number of UEs for better load
balance
GEOMETRIC/DOMAIN DECOMPOSITION
25
![Page 26: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/26.jpg)
Principle based on analogy assembly line : Data
flowing through a set of stages
Potential concurrency: Operations can be performed
simultaneously on different data items
PIPELINE
26
C1 C2 C3 C4 C5 C6
C1 C2 C3 C4 C5 C6
C1 C2 C3 C4 C5 C6
Pipeline stage 1
Pipeline stage 2
Pipeline stage 3
time
![Page 27: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/27.jpg)
Example: Instruction pipeline in CPUs
Fetch (instruction)
Decode
Execute
...
PIPELINE
27
![Page 28: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/28.jpg)
Precondition: Dependencies among tasks allow an
appropriate ordering
Efficiency constraint: Number of stages << number
of processed items
Pipeline can also be nonlinear
PIPELINE
28
![Page 29: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/29.jpg)
Intermediate stage between problem oriented
algorithm structure patterns and their realization
in a programming environment
Structures that “support” the realization of parallel
algorithms
4 examples:
Single program, multiple data (SPMD)
Task farming/Master & Worker
Fork & Join
Shared data
B: SUPPORTING STRUCTURES
29
![Page 30: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/30.jpg)
Principle: The same code runs on every UE
processing different data
Most common technique to write parallel
programs!
SINGLE PROGRAM, MULTIPLE DATA
30
![Page 31: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/31.jpg)
Program stages:
1. Initialize and obtain unique ID for each UE
2. Run the same program on every UE: Differences in the
instructions are driven by the ID
3. Distribute data by decomposing or sharing/copying global
data
Risk: Complex branching and data decomposition
can make the code awful to understand and
maintain
SINGLE PROGRAM, MULTIPLE DATA
31
![Page 32: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/32.jpg)
Principle: A master task (“farmer”) dispatches
tasks to many worker UEs and collects (“farms”)
the results
TASK FARMING/MASTER & WORKER
32
![Page 33: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/33.jpg)
TASK FARMING/MASTER & WORKER
33
![Page 34: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/34.jpg)
Precondition: Tasks are relatively independent
Master:
Initiates computation
Creates a bag of tasks and stores them e.g. in a shared queue
Launches the worker tasks and waits
Collects the results and shuts down the computation
Workers:
While the bag of tasks is not empty pop a task and solve it
Flexible through indirect scheduling
Optimization: Master can become a worker too
TASK FARMING/MASTER & WORKER
34
![Page 35: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/35.jpg)
Principle: Tasks create (“fork”) and terminate
(“join”) other tasks dynamically
Example: An algorithm designed after the Divide &
Conquer pattern
FORK & JOIN
35
![Page 36: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/36.jpg)
Mapping the tasks to UEs can be done directly or
indirectly
Direct: Each subtask is mapped to a new UE
Disadvantage: UE creation and destruction is expensive
Standard programming model in OpenMP
Indirect: Subtasks are stored inside a shared
queue and handled by a static number of UEs
Concept behind OpenMP
FORK & JOIN
36
![Page 37: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/37.jpg)
Problem: Manage access to shared data
Principle: Define an access protocol that assures
that the results of a computation are correct for
any ordering of the operations on the data
SHARED DATA
37
![Page 38: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/38.jpg)
Model shared data as a(n) (abstract) data type with
a fixed set of operations
Operations can be seen as transactions (→ ACID
properties)
Start with a simple solution and improve
performance step-by-step:
Only one operation can be executed at any point in time
Improve performance by separating operations into
noninterfering sets
Separate operations in read and write operations
Many different lock strategies…
SHARED DATA
38
![Page 39: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/39.jpg)
QUESTIONS?
![Page 40: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already](https://reader033.fdocuments.in/reader033/viewer/2022041415/5e1b30e7e3134c734e0cfe03/html5/thumbnails/40.jpg)
T. Mattson, B. Sanders and B. Massingill. Patterns for parallel programming. Addison-Wesley, 2004.
A. Grama, A. Gupta, G. Karypis and V. Kumar. Introduction to parallel computing. Addison Wesley, 2nd Edition, 2003.
P. S. Pacheco. An introduction to parallel programming . Morgan Kaufmann, 2011.
Images from Mattson et al. 2004
REFERENCES
40