Concurrent and Distributed Programming -...
Transcript of Concurrent and Distributed Programming -...
Concurrent and Distributed Programming
Lecture 1Introduction
References: “Intro to parallel computing” by Blaise Barney
Lecture 1, CS149 Stanford by Aiken & Olukotun,“Intro to parallel programming”, Gupta
Mark Silberstein, 236370, Winter11 2
Administration● In charge: Assaf Schuster● Lectures: Mark Silberstein
● Reception hours upon request ● Office: Fishbach 453
● TA in charge: Nadav Amit● Administrative questions
● TA: Uri Verner, Checker: Yossi Kuperman, Assistant: Adi Omri
● Syllabus on the site – slight changes possible
Mark Silberstein, 236370, Winter11 3
Grading● 50% Exam, 50% Home assignments (only if
Exam >55)● 4 assignments planned, see the syllabus to check the
deadlines– Threads (Java) – decent, OpenMP (C) – easy, BSP (C) - easy
– MPI (C) – hard
● We use plagiarism software to detect copying, and punish badly
● Late submissions: 3 late days for free/semester, 3pts/day otherwise
● No postponement (don't bother asking)
Mark Silberstein, 236370, Winter11 4
Prerequisites
● Formal● 234123 Operating Systems● 234267 Digital Computer Structure
– Or EE analog
● Informal● Programming skills (makefile, Linux, ssh)● Positive attitude
Mark Silberstein, 236370, Winter11 5
Question
● I write “hello world” program in plain C● I have the latest and the greatest computer with
the latest and the greatest OS and compiler
Does my program run in parallel?
Mark Silberstein, 236370, Winter11 6
Serial vs. parallel program
● One instruction at a time
● Multiple instructions in parallel
Mark Silberstein, 236370, Winter11 7
Why parallel programming?
● Most software will have to written as a parallel software
● Why ?
Mark Silberstein, 236370, Winter11 8
Free lunch ...● Early 90-s:
● Wait 18 months – get new CPU with x2 speedMoore's law: #transitors per chip = 2(years)
Source: Wikipedia
Mark Silberstein, 236370, Winter11 9
is over
● More transistors = more execution units● BUT performance per unit does not improve
● Bad news: serial programs will not run faster
● Good news: parallelization has high performance potential● May get faster on newer architectures
Mark Silberstein, 236370, Winter11 10
Parallel programming is hard● Need to optimize for performance
● Understand management of resources● Identify bottlenecks
● No one technology fits all needs● Zoo of programming models, languages, runtimes
● Hardware architecture is a moving target● Parallel thinking is NOT intuitive● Parallel debugging is not fun
But there is no detour
Mark Silberstein, 236370, Winter11 11
Concurrent and Distributed Programming Course
● Learn● basic principles ● Parallel and distributed architectures● Parallel programming techniques ● Basics of programming for performance
● This is a practical course – you will fail unless you complete all home assignments
Mark Silberstein, 236370, Winter11 12
Parallelizing Game of Life(Cellular automaton)
● Given a 2D grid● vt(i,j)=F(vt-1(of all its neighbors))
i-1
j-1
i+1
j+1
i,j
Mark Silberstein, 236370, Winter11 13
Problem partitioning● Domain decomposition
● (SPMD)● Input domain ● Output domain● Both
● Functional decomposition● (MPMD)● Independent tasks● Pipelining
Mark Silberstein, 236370, Winter11 14
We choose: domain decomposition
● The field is split between processors
i-1
j-1
i+1
j+1
i,j
CPU 0 CPU 1
Mark Silberstein, 236370, Winter11 15
Issue 1. Memory
● Can we access v(i+1,j) from CPU 0 as in serial program?
i-1
j-1
i+1
j+1
i,j
CPU 0 CPU 1
Mark Silberstein, 236370, Winter11 16
It depends...
● NO: Distributed memory space architecture: disjoint memory space
● YES: Shared memory space architecture: same memory space
CPU 0 CPU 1
Memory Memory
Network
CPU 0 CPU 1
Memory
Mark Silberstein, 236370, Winter11 17
Someone has to pay
● Harder to program, easier to build hardware
● Easier to program, harder to build hardware
CPU 0 CPU 1
Memory Memory
Network
CPU 0 CPU 1
Memory
Mark Silberstein, 236370, Winter11 18
Memory architecture
● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)?
i-1
j-1
i+1
j+1
i,j
CPU 0 CPU 1
Mark Silberstein, 236370, Winter11 19
Hardware shared memory flavors 1
● Uniform memory access: UMA● Same cost of accessing any data by all processors
Mark Silberstein, 236370, Winter11 20
Hardware shared memory flavors 2● NON-Uniform memory access: NUMA
● Software - DSM: emulation of NUMA in software over distributed memory space
Mark Silberstein, 236370, Winter11 21
Software Distributed Shared Memory (SDSM)
CPU 0 CPU 1
Memory MemoryNetwork
Process 1 Process 2
DSM daemons
Mark Silberstein, 236370, Winter11 22
Memory-optimized programming
● Most modern systems are NUMA or distributed● Access time difference: local vs. remote data:
x100-10000● Memory accesses: main source of
optimization in parallel and distributed programs
Mark Silberstein, 236370, Winter11 23
Issue 2: Control
● Can we assign one v per CPU?
i-1
j-1
i+1
j+1
i,j
Mark Silberstein, 236370, Winter11 24
Task management overhead
● Each task has a state that should be managed● More tasks – more state to manage
● Who manages tasks?
● How many tasks should be run?● Does that depend on F?
– Reminder: v(i,j)= F(all v's neighbors)
Mark Silberstein, 236370, Winter11 25
Question
● Every process reads the data from its neighbors● Will it produce correct results?
i-1
j-1
i+1
j+1
i,j
Mark Silberstein, 236370, Winter11 26
Synchronization
● The order of reads and writes made in different tasks is non-deterministic
● Synchronization required to enforce the order● Locks, semaphores, barriers, conditionals
Mark Silberstein, 236370, Winter11 27
Check point
Fundamental hardware-related issues
● Memory accesses● Optimizing locality of accesses
● Control● Overhead
● Synchronization
Mark Silberstein, 236370, Winter11 28
Parallel programming issues
i-1
j-1
i+1
j+1
i,j
CPU 0 CPU 1
CPU 2 CPU 4
● We decide to split this 3x3 grid like this:
Problems?
Mark Silberstein, 236370, Winter11 29
Issue 1: Load balancing
● Always waiting for the slowest task● Solutions?
i-1
j-1
i+1
j+1
i,j
CPU 0 CPU 1
CPU 2 CPU 4
Mark Silberstein, 236370, Winter11 30
Issue 2: Granularity● G=Computation/Communication ● Fine-grain parallelism.
● G is small● Good load balancing● Potentially high overhead
● Coarse-grain parallelism● G is large● Potentially bad load balancing● Low overhead
Which granularity works for you?
Mark Silberstein, 236370, Winter11 31
It depends..
● For each combination of computing platform and parallel application● The goal is to minimize overheads and maximize
CPU utilization
● High performance requires ● Enough parallelism to keep CPUs busy● Low relative overheads (communications,
synchronization, task control, memory accesses versus computations)
● Good load balancing
Mark Silberstein, 236370, Winter11 32
Summary
● Parallelism does not come for free● Overhead● Memory-aware access● Synchronization● Granularity
Mark Silberstein, 236370, Winter11 33
Flynn's HARDWARE Taxonomy
Mark Silberstein, 236370, Winter11 34
SIMD
Lock-step execution
● Example: vector operations (SSE)
Mark Silberstein, 236370, Winter11 35
MIMD
● Example: multicores