Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

Post on 17-Jun-2020

0 views 0 download

Transcript of Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

ParallelismMarco Serafini

COMPSCI 590SLecture 3

2

Announcements• Reviews

• First paper posted on website

• Review due by this Wednesday 11 PM (hard deadline)

• Data Science Career Mixer (save the date!)• November 5, 4-7 pm

• Campus Center Auditorium

• Recruiting and industry engagement event

3

Why multi-core architectures?

4

Multi-Cores• We have talked about multi-core architectures• Why do we actually use multi-cores?• Why not a single core?

5

Maximum Clock Rate is Stagnating

Source: https://queue.acm.org/detail.cfm?id=2181798

Two major “laws” are collapsing• Moore’s law• Dennard scaling

6

Moore’s Law• “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster

So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]

[1] https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/

Expo

nent

ial a

xis

7

Dennard Scaling• “Reducing transistor size does not increase power

density à power consumption proportional to chip

area”

• Stopped holding around 2006

• Assumptions break when physical system close to limit

• Post-Dennard-scaling world of today

• Huge cooling and power consumption issues

• If we kept the same clock frequency trends, today a CPU

would have the power density of a nuclear reactor

8

Heat Dissipation Problem• Large datacenters consume energy like large cities• Cooling is the main cost factor

Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

9

Where is Luleå?

10

Possible Solutions• Dynamic Voltage and Frequency Scaling (DVFS)

• E.g. Intel’s TurboBoost• Only works under low load

• Use part of the chip for coprocessors (e.g. graphics)• Lower power consumption• Limited number of generic functionalities to offload

11

More Solutions• Multicores

• Replace 1 powerful core with multiple weaker cores on a chip• SIMD

• Single Instruction Multiple Data• A massive number of cores with reduced flexibility

• FPGAs• Dedicated hardware designed for a specific task

12

Multi-Core processors• Idea: scale computational power linearly

• Instead of a single 5 GHz core, 2 * 2.5 GHz cores• Scale heat dissipation linearly

• k cores have ~ k times the heat dissipation of a single core• Increasing frequency of a single core by k times creates superlinear heat dissipation increase

13

Memory Bandwidth Bottleneck• Cores compete for the same main memory bus• Caches help in two ways

• They reduce latency (as we have discussed)• They also increase throughput by avoiding bus contention

14

How to Leverage Multicores• Run multiple tasks in parallel

• Multiprocessing• Multithreading

• E.g. PCs have many parallel background apps• OS, music, antivirus, web browser, …

• How to parallelize one app is not trivial• Embarrassingly parallel tasks

• Can be run by multiple threads• No coordination

15

SIMD Processors• Single Instruction Multiple Data (SIMD) processors• Example

• Graphical Processing Units (GPUs)• Intel Phi coprocessors

• Q: Possible SIMD snippets

for i in [0,n-1] dov[i] = v[i] * pi

for i in [0,n-1] doif v[i] < 0.01 then

v[i] = 0

16

Automatic Parallelization?• Holy grail in the multi-processor era• Approaches

• Programming languages• Systems with APIs that help express parallelism• Efficient coordination mechanisms

17

Processes vs. Threads

18

Processes & Threads• We have discussed that multicores is the future• How to make use of parallelism?• OS/PL support for parallel programming

• Processes• Threads

19

Processes vs. Threads• Process: separate memory space• Thread: shared memory space (except stack)

Processes ThreadsHeap not shared sharedGlobal variables not shared sharedLocal variables (Stack) not shared not sharedCode shared sharedFile handles not shared shared

20

Parallel Programming• Shared memory

• Threads • Access same memory locations (in heap & global variables)

• Message-Passing• Processes• Explicit communication: message-passing

Shared Memory

22

Shared Memory Examplevoid main (){

x = 12; // assume that x is a global variablet = new ThreadX();t.start(); // starts thread ty = 12/x;System.out.println(y);t.join(); // wait until t completes

}

class ThreadX extends Thread{void run (){

x = 0;}

}

• Question: What is printed as output?

This is “pseudo-Java”

in C++:pthread_createpthread_join

23

Desired: AtomicityThread a…foo()…

Thread b…foo()…

void foo (){x = 0;x = 1;y = 1/x;

}

x = 0x = 1y = 1

x = 0x = 1y = 1

Thread a Thread b

time

happens-before

changes become visible

DESIRED

x = 0x = 1

y = 1/0

x = 0

Thread a Thread bPOSSIBLE

foo should be atomic, in the sense of indivisible (ancient Greek)

24

Race Condition• Non-deterministic access to shared variables

• Correctness requires specific sequence of accesses

• But we cannot rely on it because of non-determinism!

• Solutions

• Enforce a specific order using synchronization• Enforce a sequence of happen-before relationships

• Locks, mutexes, semaphores: threads block each other

• Lock-free algorithms: threads do not wait for each other• Hard to implement correctly! Typical programmer uses locks

• Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap

25

LocksThread a…l.lock()foo()l.unlock()

Thread b…l.lock()foo()l.unlock()

void foo (){x = 0;x ++;y = 1/x;

}

x = 0x = 1

x = 0

Thread a Thread bImpossible now

l.lock()foo()

Thread a Thread b

time

Possible

l.lock() - waits

foo()l.unlock()

l.unlock()

l.lock() - acquires

We use a lock variable land use it to synchronize

Equivalent: declarevoid synchronized foo()

26

Deadlock

• Question: What can go wrong?

Thread a…l1.lock()l2.lock()foo()l1.unlock()l2.unlock()

Thread b…l2.lock()l1.lock()foo()l2.unlock()l1.unlock()

27

Requirements for a Deadlock• Mutual exclusion: resources (locks) held and non-shareable• Hold and wait: hold a resource and request another• No preemption: can unlock only when holding• Circular wait: chain of threads waiting each other

• Question: Simple solution? • All threads acquire locks in same order

28

Notify / WaitThread a…synchronized(o){

o.wait();foo();

}

Thread b…synchronized(o){

foo();o.notify();

}

o.wait()…

Thread a waits…

Thread a Thread b

foo()o.notify()

o.wait()foo()

Notify on an object sends a signal that activates other threads waiting on that object

This code guarantees that Thread b executes foo before Thread a

29

What About Cache Coherency?• Cache coherency ensures atomicity for

• Single instructions• Single cache lines

• In reality• Different variables may reside on different cache lines• A variable may be accessed across multiple instructions

• Single high-level instructions may compile to multiple low-level ones• Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a)

• That’s why we need locks• Main lesson learned from cache coherency discussion: you should partition data

30

Challenges with Multi-Threading• Correctness

• Heisenbugs: Non-deterministic bugs that appear only in certain conditions.• Hard to reproduce à Hard to debug

• Performance• Understanding concurrency bottlenecks is hard!• “Waiting time” does not show up in profilers (only CPU time)

• Load-balance• Make sure all cores work all the time and do not wait

31

Critical Path• Coordination (barrier) makes load balancing harder• Critical path: Maximum sequential path (thread t1, 10 steps)

t1

t1 t2 t3

start multiple threads

t1wait for all threads

to complete (barrier)

t1

9 extrasteps

one step each

Message Passing

33

Message Passing• Processes communicate by exchanging messages• Sockets: Communication endpoints

• On a network: UDP sockets, TCP sockets• Internal to a node: Inter-Process Communication (IPC)• Different technologies but similar abstractions

34

Building a Message• Serialization

• Message content stored at random locations in RAM • They need to be packed into a byte array to be sent

• Deserialization• Receive the byte array• Rebuild the original variable

• Pointers do not make sense anymore across nodes!

35

Example: Serializing a Binary Tree• Question: How to serialize it?• Possible solution

• DFS• Mark null pointers with -1

• How to deserialize?

10

12null null

5null null 10 -15 -1 12 -1 -1

36

Threads + Message Passing• Client-server model

• Client sends requests• Server computes replies and sends them back

• Threads often used to hide latency• Each client request is handled by a thread• The request might wait for resources (e.g. I/O)• Other threads execute other requests in the meanwhile

37

Processes in Different Languages• Java (interpreted)

• The Java Virtual Machine (interpreter) is a process• Creating a new process entails creating a new JVM

• ProcessBuilder

• C/C++ (compiled)• OS-specific details of how processes can be generated• Typical command: fork()

• Creates a child process, which executes instruction after fork()• Child process is a full copy of the parent

• More on forking later