Concurrent and Distributed Programming -...

Concurrent and Distributed Programming

Lecture 1Introduction

References: “Intro to parallel computing” by Blaise Barney

Lecture 1, CS149 Stanford by Aiken & Olukotun,“Intro to parallel programming”, Gupta

Mark Silberstein, 236370, Winter11 2

Administration● In charge: Assaf Schuster● Lectures: Mark Silberstein

● Reception hours upon request ● Office: Fishbach 453

● TA in charge: Nadav Amit● Administrative questions

● TA: Uri Verner, Checker: Yossi Kuperman, Assistant: Adi Omri

● Syllabus on the site – slight changes possible


Grading● 50% Exam, 50% Home assignments (only if

Exam >55)● 4 assignments planned, see the syllabus to check the

deadlines– Threads (Java) – decent, OpenMP (C) – easy, BSP (C) - easy

– MPI (C) – hard

● We use plagiarism software to detect copying, and punish badly

● Late submissions: 3 late days for free/semester, 3pts/day otherwise

● No postponement (don't bother asking)


Prerequisites

● Formal● 234123 Operating Systems● 234267 Digital Computer Structure

– Or EE analog

● Informal● Programming skills (makefile, Linux, ssh)● Positive attitude


Question

● I write “hello world” program in plain C● I have the latest and the greatest computer with

the latest and the greatest OS and compiler

Does my program run in parallel?


Serial vs. parallel program

● One instruction at a time

● Multiple instructions in parallel


Why parallel programming?

● Most software will have to written as a parallel software

● Why ?


Free lunch ...● Early 90-s:

● Wait 18 months – get new CPU with x2 speedMoore's law: #transitors per chip = 2(years)

Source: Wikipedia


is over

● More transistors = more execution units● BUT performance per unit does not improve

● Bad news: serial programs will not run faster

● Good news: parallelization has high performance potential● May get faster on newer architectures


Parallel programming is hard● Need to optimize for performance

● Understand management of resources● Identify bottlenecks

● No one technology fits all needs● Zoo of programming models, languages, runtimes

● Hardware architecture is a moving target● Parallel thinking is NOT intuitive● Parallel debugging is not fun

But there is no detour


Concurrent and Distributed Programming Course

● Learn● basic principles ● Parallel and distributed architectures● Parallel programming techniques ● Basics of programming for performance

● This is a practical course – you will fail unless you complete all home assignments


Parallelizing Game of Life(Cellular automaton)

● Given a 2D grid● vt(i,j)=F(vt-1(of all its neighbors))

i-1

j-1

i+1

j+1

i,j


Problem partitioning● Domain decomposition

● (SPMD)● Input domain ● Output domain● Both

● Functional decomposition● (MPMD)● Independent tasks● Pipelining


We choose: domain decomposition

● The field is split between processors

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1


Issue 1. Memory

● Can we access v(i+1,j) from CPU 0 as in serial program?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1


It depends...

● NO: Distributed memory space architecture: disjoint memory space

● YES: Shared memory space architecture: same memory space

CPU 0 CPU 1

Memory Memory

Network

CPU 0 CPU 1

Memory


Someone has to pay

● Harder to program, easier to build hardware

● Easier to program, harder to build hardware

CPU 0 CPU 1

Memory Memory

Network

CPU 0 CPU 1

Memory


Memory architecture

● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1


Hardware shared memory flavors 1

● Uniform memory access: UMA● Same cost of accessing any data by all processors


Hardware shared memory flavors 2● NON-Uniform memory access: NUMA

● Software - DSM: emulation of NUMA in software over distributed memory space


Software Distributed Shared Memory (SDSM)

CPU 0 CPU 1

Memory MemoryNetwork

Process 1 Process 2

DSM daemons


Memory-optimized programming

● Most modern systems are NUMA or distributed● Access time difference: local vs. remote data:

x100-10000● Memory accesses: main source of

optimization in parallel and distributed programs


Issue 2: Control

● Can we assign one v per CPU?

i-1

j-1

i+1

j+1

i,j


Task management overhead

● Each task has a state that should be managed● More tasks – more state to manage

● Who manages tasks?

● How many tasks should be run?● Does that depend on F?

– Reminder: v(i,j)= F(all v's neighbors)


Question

● Every process reads the data from its neighbors● Will it produce correct results?

i-1

j-1

i+1

j+1

i,j


Synchronization

● The order of reads and writes made in different tasks is non-deterministic

● Synchronization required to enforce the order● Locks, semaphores, barriers, conditionals


Check point

Fundamental hardware-related issues

● Memory accesses● Optimizing locality of accesses

● Control● Overhead

● Synchronization


Parallel programming issues

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

CPU 2 CPU 4

● We decide to split this 3x3 grid like this:

Problems?


Issue 1: Load balancing

● Always waiting for the slowest task● Solutions?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

CPU 2 CPU 4


Issue 2: Granularity● G=Computation/Communication ● Fine-grain parallelism.

● G is small● Good load balancing● Potentially high overhead

● Coarse-grain parallelism● G is large● Potentially bad load balancing● Low overhead

Which granularity works for you?


It depends..

● For each combination of computing platform and parallel application● The goal is to minimize overheads and maximize

CPU utilization

● High performance requires ● Enough parallelism to keep CPUs busy● Low relative overheads (communications,

synchronization, task control, memory accesses versus computations)

● Good load balancing


Summary

● Parallelism does not come for free● Overhead● Memory-aware access● Synchronization● Granularity


Flynn's HARDWARE Taxonomy


SIMD

Lock-step execution

● Example: vector operations (SSE)


MIMD

● Example: multicores

Concurrent and Distributed Programming -...

Documents

Transcript of Concurrent and Distributed Programming -...