Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

37
Parallelism Marco Serafini COMPSCI 590S Lecture 3

Transcript of Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

Page 1: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

ParallelismMarco Serafini

COMPSCI 590SLecture 3

Page 2: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

2

Announcements• Reviews

• First paper posted on website

• Review due by this Wednesday 11 PM (hard deadline)

• Data Science Career Mixer (save the date!)• November 5, 4-7 pm

• Campus Center Auditorium

• Recruiting and industry engagement event

Page 3: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

3

Why multi-core architectures?

Page 4: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

4

Multi-Cores• We have talked about multi-core architectures• Why do we actually use multi-cores?• Why not a single core?

Page 5: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

5

Maximum Clock Rate is Stagnating

Source: https://queue.acm.org/detail.cfm?id=2181798

Two major “laws” are collapsing• Moore’s law• Dennard scaling

Page 6: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

6

Moore’s Law• “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster

So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]

[1] https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/

Expo

nent

ial a

xis

Page 7: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

7

Dennard Scaling• “Reducing transistor size does not increase power

density à power consumption proportional to chip

area”

• Stopped holding around 2006

• Assumptions break when physical system close to limit

• Post-Dennard-scaling world of today

• Huge cooling and power consumption issues

• If we kept the same clock frequency trends, today a CPU

would have the power density of a nuclear reactor

Page 8: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

8

Heat Dissipation Problem• Large datacenters consume energy like large cities• Cooling is the main cost factor

Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

Page 9: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

9

Where is Luleå?

Page 10: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

10

Possible Solutions• Dynamic Voltage and Frequency Scaling (DVFS)

• E.g. Intel’s TurboBoost• Only works under low load

• Use part of the chip for coprocessors (e.g. graphics)• Lower power consumption• Limited number of generic functionalities to offload

Page 11: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

11

More Solutions• Multicores

• Replace 1 powerful core with multiple weaker cores on a chip• SIMD

• Single Instruction Multiple Data• A massive number of cores with reduced flexibility

• FPGAs• Dedicated hardware designed for a specific task

Page 12: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

12

Multi-Core processors• Idea: scale computational power linearly

• Instead of a single 5 GHz core, 2 * 2.5 GHz cores• Scale heat dissipation linearly

• k cores have ~ k times the heat dissipation of a single core• Increasing frequency of a single core by k times creates superlinear heat dissipation increase

Page 13: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

13

Memory Bandwidth Bottleneck• Cores compete for the same main memory bus• Caches help in two ways

• They reduce latency (as we have discussed)• They also increase throughput by avoiding bus contention

Page 14: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

14

How to Leverage Multicores• Run multiple tasks in parallel

• Multiprocessing• Multithreading

• E.g. PCs have many parallel background apps• OS, music, antivirus, web browser, …

• How to parallelize one app is not trivial• Embarrassingly parallel tasks

• Can be run by multiple threads• No coordination

Page 15: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

15

SIMD Processors• Single Instruction Multiple Data (SIMD) processors• Example

• Graphical Processing Units (GPUs)• Intel Phi coprocessors

• Q: Possible SIMD snippets

for i in [0,n-1] dov[i] = v[i] * pi

for i in [0,n-1] doif v[i] < 0.01 then

v[i] = 0

Page 16: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

16

Automatic Parallelization?• Holy grail in the multi-processor era• Approaches

• Programming languages• Systems with APIs that help express parallelism• Efficient coordination mechanisms

Page 17: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

17

Processes vs. Threads

Page 18: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

18

Processes & Threads• We have discussed that multicores is the future• How to make use of parallelism?• OS/PL support for parallel programming

• Processes• Threads

Page 19: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

19

Processes vs. Threads• Process: separate memory space• Thread: shared memory space (except stack)

Processes ThreadsHeap not shared sharedGlobal variables not shared sharedLocal variables (Stack) not shared not sharedCode shared sharedFile handles not shared shared

Page 20: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

20

Parallel Programming• Shared memory

• Threads • Access same memory locations (in heap & global variables)

• Message-Passing• Processes• Explicit communication: message-passing

Page 21: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

Shared Memory

Page 22: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

22

Shared Memory Examplevoid main (){

x = 12; // assume that x is a global variablet = new ThreadX();t.start(); // starts thread ty = 12/x;System.out.println(y);t.join(); // wait until t completes

}

class ThreadX extends Thread{void run (){

x = 0;}

}

• Question: What is printed as output?

This is “pseudo-Java”

in C++:pthread_createpthread_join

Page 23: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

23

Desired: AtomicityThread a…foo()…

Thread b…foo()…

void foo (){x = 0;x = 1;y = 1/x;

}

x = 0x = 1y = 1

x = 0x = 1y = 1

Thread a Thread b

time

happens-before

changes become visible

DESIRED

x = 0x = 1

y = 1/0

x = 0

Thread a Thread bPOSSIBLE

foo should be atomic, in the sense of indivisible (ancient Greek)

Page 24: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

24

Race Condition• Non-deterministic access to shared variables

• Correctness requires specific sequence of accesses

• But we cannot rely on it because of non-determinism!

• Solutions

• Enforce a specific order using synchronization• Enforce a sequence of happen-before relationships

• Locks, mutexes, semaphores: threads block each other

• Lock-free algorithms: threads do not wait for each other• Hard to implement correctly! Typical programmer uses locks

• Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap

Page 25: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

25

LocksThread a…l.lock()foo()l.unlock()

Thread b…l.lock()foo()l.unlock()

void foo (){x = 0;x ++;y = 1/x;

}

x = 0x = 1

x = 0

Thread a Thread bImpossible now

l.lock()foo()

Thread a Thread b

time

Possible

l.lock() - waits

foo()l.unlock()

l.unlock()

l.lock() - acquires

We use a lock variable land use it to synchronize

Equivalent: declarevoid synchronized foo()

Page 26: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

26

Deadlock

• Question: What can go wrong?

Thread a…l1.lock()l2.lock()foo()l1.unlock()l2.unlock()

Thread b…l2.lock()l1.lock()foo()l2.unlock()l1.unlock()

Page 27: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

27

Requirements for a Deadlock• Mutual exclusion: resources (locks) held and non-shareable• Hold and wait: hold a resource and request another• No preemption: can unlock only when holding• Circular wait: chain of threads waiting each other

• Question: Simple solution? • All threads acquire locks in same order

Page 28: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

28

Notify / WaitThread a…synchronized(o){

o.wait();foo();

}

Thread b…synchronized(o){

foo();o.notify();

}

o.wait()…

Thread a waits…

Thread a Thread b

foo()o.notify()

o.wait()foo()

Notify on an object sends a signal that activates other threads waiting on that object

This code guarantees that Thread b executes foo before Thread a

Page 29: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

29

What About Cache Coherency?• Cache coherency ensures atomicity for

• Single instructions• Single cache lines

• In reality• Different variables may reside on different cache lines• A variable may be accessed across multiple instructions

• Single high-level instructions may compile to multiple low-level ones• Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a)

• That’s why we need locks• Main lesson learned from cache coherency discussion: you should partition data

Page 30: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

30

Challenges with Multi-Threading• Correctness

• Heisenbugs: Non-deterministic bugs that appear only in certain conditions.• Hard to reproduce à Hard to debug

• Performance• Understanding concurrency bottlenecks is hard!• “Waiting time” does not show up in profilers (only CPU time)

• Load-balance• Make sure all cores work all the time and do not wait

Page 31: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

31

Critical Path• Coordination (barrier) makes load balancing harder• Critical path: Maximum sequential path (thread t1, 10 steps)

t1

t1 t2 t3

start multiple threads

t1wait for all threads

to complete (barrier)

t1

9 extrasteps

one step each

Page 32: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

Message Passing

Page 33: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

33

Message Passing• Processes communicate by exchanging messages• Sockets: Communication endpoints

• On a network: UDP sockets, TCP sockets• Internal to a node: Inter-Process Communication (IPC)• Different technologies but similar abstractions

Page 34: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

34

Building a Message• Serialization

• Message content stored at random locations in RAM • They need to be packed into a byte array to be sent

• Deserialization• Receive the byte array• Rebuild the original variable

• Pointers do not make sense anymore across nodes!

Page 35: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

35

Example: Serializing a Binary Tree• Question: How to serialize it?• Possible solution

• DFS• Mark null pointers with -1

• How to deserialize?

10

12null null

5null null 10 -15 -1 12 -1 -1

Page 36: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

36

Threads + Message Passing• Client-server model

• Client sends requests• Server computes replies and sends them back

• Threads often used to hide latency• Each client request is handled by a thread• The request might wait for resources (e.g. I/O)• Other threads execute other requests in the meanwhile

Page 37: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly

37

Processes in Different Languages• Java (interpreted)

• The Java Virtual Machine (interpreter) is a process• Creating a new process entails creating a new JVM

• ProcessBuilder

• C/C++ (compiled)• OS-specific details of how processes can be generated• Typical command: fork()

• Creates a child process, which executes instruction after fork()• Child process is a full copy of the parent

• More on forking later