Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

ParallelismMarco Serafini

COMPSCI 590SLecture 3

Announcements• Reviews

• First paper posted on website

• Review due by this Wednesday 11 PM (hard deadline)

• Data Science Career Mixer (save the date!)• November 5, 4-7 pm

• Campus Center Auditorium

• Recruiting and industry engagement event

Why multi-core architectures?

Multi-Cores• We have talked about multi-core architectures• Why do we actually use multi-cores?• Why not a single core?

Maximum Clock Rate is Stagnating

Source: https://queue.acm.org/detail.cfm?id=2181798

Two major “laws” are collapsing• Moore’s law• Dennard scaling

Moore’s Law• “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster

So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]

[1] https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/

Dennard Scaling• “Reducing transistor size does not increase power

density à power consumption proportional to chip

area”

• Stopped holding around 2006

• Assumptions break when physical system close to limit

• Post-Dennard-scaling world of today

• Huge cooling and power consumption issues

• If we kept the same clock frequency trends, today a CPU

would have the power density of a nuclear reactor

Heat Dissipation Problem• Large datacenters consume energy like large cities• Cooling is the main cost factor

Google @ Columbia River valley (2006) Facebook @ Luleå (2015)

Where is Luleå?

Possible Solutions• Dynamic Voltage and Frequency Scaling (DVFS)

• E.g. Intel’s TurboBoost• Only works under low load

• Use part of the chip for coprocessors (e.g. graphics)• Lower power consumption• Limited number of generic functionalities to offload

More Solutions• Multicores

• Replace 1 powerful core with multiple weaker cores on a chip• SIMD

• Single Instruction Multiple Data• A massive number of cores with reduced flexibility

• FPGAs• Dedicated hardware designed for a specific task

Multi-Core processors• Idea: scale computational power linearly

• Instead of a single 5 GHz core, 2 * 2.5 GHz cores• Scale heat dissipation linearly

• k cores have ~ k times the heat dissipation of a single core• Increasing frequency of a single core by k times creates superlinear heat dissipation increase

Memory Bandwidth Bottleneck• Cores compete for the same main memory bus• Caches help in two ways

• They reduce latency (as we have discussed)• They also increase throughput by avoiding bus contention

How to Leverage Multicores• Run multiple tasks in parallel

• Multiprocessing• Multithreading

• E.g. PCs have many parallel background apps• OS, music, antivirus, web browser, …

• How to parallelize one app is not trivial• Embarrassingly parallel tasks

• Can be run by multiple threads• No coordination

SIMD Processors• Single Instruction Multiple Data (SIMD) processors• Example

• Graphical Processing Units (GPUs)• Intel Phi coprocessors

• Q: Possible SIMD snippets

for i in [0,n-1] dov[i] = v[i] * pi

for i in [0,n-1] doif v[i] < 0.01 then

v[i] = 0

Automatic Parallelization?• Holy grail in the multi-processor era• Approaches

• Programming languages• Systems with APIs that help express parallelism• Efficient coordination mechanisms

Processes vs. Threads

Processes & Threads• We have discussed that multicores is the future• How to make use of parallelism?• OS/PL support for parallel programming

• Processes• Threads

Processes vs. Threads• Process: separate memory space• Thread: shared memory space (except stack)

Processes ThreadsHeap not shared sharedGlobal variables not shared sharedLocal variables (Stack) not shared not sharedCode shared sharedFile handles not shared shared

Parallel Programming• Shared memory

• Threads • Access same memory locations (in heap & global variables)

• Message-Passing• Processes• Explicit communication: message-passing

Shared Memory

Shared Memory Examplevoid main (){

x = 12; // assume that x is a global variablet = new ThreadX();t.start(); // starts thread ty = 12/x;System.out.println(y);t.join(); // wait until t completes

class ThreadX extends Thread{void run (){

x = 0;}

• Question: What is printed as output?

This is “pseudo-Java”

in C++:pthread_createpthread_join

Desired: AtomicityThread a…foo()…

Thread b…foo()…

void foo (){x = 0;x = 1;y = 1/x;

x = 0x = 1y = 1

Thread a Thread b

happens-before

changes become visible

DESIRED

x = 0x = 1

y = 1/0

Thread a Thread bPOSSIBLE

foo should be atomic, in the sense of indivisible (ancient Greek)

Race Condition• Non-deterministic access to shared variables

• Correctness requires specific sequence of accesses

• But we cannot rely on it because of non-determinism!

• Solutions

• Enforce a specific order using synchronization• Enforce a sequence of happen-before relationships

• Locks, mutexes, semaphores: threads block each other

• Lock-free algorithms: threads do not wait for each other• Hard to implement correctly! Typical programmer uses locks

• Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap

LocksThread a…l.lock()foo()l.unlock()

Thread b…l.lock()foo()l.unlock()

void foo (){x = 0;x ++;y = 1/x;

x = 0x = 1

Thread a Thread bImpossible now

l.lock()foo()

Thread a Thread b

Possible

l.lock() - waits

foo()l.unlock()

l.unlock()

l.lock() - acquires

We use a lock variable land use it to synchronize

Equivalent: declarevoid synchronized foo()

Deadlock

• Question: What can go wrong?

Thread a…l1.lock()l2.lock()foo()l1.unlock()l2.unlock()

Thread b…l2.lock()l1.lock()foo()l2.unlock()l1.unlock()

Requirements for a Deadlock• Mutual exclusion: resources (locks) held and non-shareable• Hold and wait: hold a resource and request another• No preemption: can unlock only when holding• Circular wait: chain of threads waiting each other

• Question: Simple solution? • All threads acquire locks in same order

Notify / WaitThread a…synchronized(o){

o.wait();foo();

Thread b…synchronized(o){

foo();o.notify();

o.wait()…

Thread a waits…

Thread a Thread b

foo()o.notify()

o.wait()foo()

Notify on an object sends a signal that activates other threads waiting on that object

This code guarantees that Thread b executes foo before Thread a

What About Cache Coherency?• Cache coherency ensures atomicity for

• Single instructions• Single cache lines

• In reality• Different variables may reside on different cache lines• A variable may be accessed across multiple instructions

• Single high-level instructions may compile to multiple low-level ones• Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a)

• That’s why we need locks• Main lesson learned from cache coherency discussion: you should partition data

Challenges with Multi-Threading• Correctness

• Heisenbugs: Non-deterministic bugs that appear only in certain conditions.• Hard to reproduce à Hard to debug

• Performance• Understanding concurrency bottlenecks is hard!• “Waiting time” does not show up in profilers (only CPU time)

• Load-balance• Make sure all cores work all the time and do not wait

Critical Path• Coordination (barrier) makes load balancing harder• Critical path: Maximum sequential path (thread t1, 10 steps)

t1 t2 t3

start multiple threads

t1wait for all threads

to complete (barrier)

9 extrasteps

one step each

Message Passing

Message Passing• Processes communicate by exchanging messages• Sockets: Communication endpoints

• On a network: UDP sockets, TCP sockets• Internal to a node: Inter-Process Communication (IPC)• Different technologies but similar abstractions

Building a Message• Serialization

• Message content stored at random locations in RAM • They need to be packed into a byte array to be sent

• Deserialization• Receive the byte array• Rebuild the original variable

• Pointers do not make sense anymore across nodes!

Example: Serializing a Binary Tree• Question: How to serialize it?• Possible solution

• DFS• Mark null pointers with -1

• How to deserialize?

12null null

5null null 10 -15 -1 12 -1 -1

Threads + Message Passing• Client-server model

• Client sends requests• Server computes replies and sends them back

• Threads often used to hide latency• Each client request is handled by a thread• The request might wait for resources (e.g. I/O)• Other threads execute other requests in the meanwhile

Processes in Different Languages• Java (interpreted)

• The Java Virtual Machine (interpreter) is a process• Creating a new process entails creating a new JVM

• ProcessBuilder

• C/C++ (compiled)• OS-specific details of how processes can be generated• Typical command: fork()

• Creates a child process, which executes instruction after fork()• Child process is a full copy of the parent

• More on forking later

Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

Documents

Transcript of Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly...

High Capacity IP Ethernet Full Outdoor · 7/8 GHz 11 GHz 13/15 GHz 17 GHz UNLICENCED* 18/23 GHz 26 GHz 38 GHz 42 GHz 7/8 GHz 11 GHz 13/15 GHz 17 GHz 18/23 GHz 26 GHz 38 GHz 42 GHz

Analytical Query Processing - Marco Serafini

Algorithmic Linearly Constrained Gaussian Processespapers.nips.cc/...algorithmic-linearly-constrained-gaussian-processes… · Roughly speaking, Gaussian processes are the linear

LOGPER ANTENNAS HYPERLOG - Aaronia AG...8,12 GHz 9,05 GHz 10,0 GHz 10 dBi 5 dBi 0 dBi Gain Diagram HyperLOG® 60100 0,68 GHz 1,61 GHz 2,54 GHz 3,47 GHz 4,40 GHz 5,33 GHz 6,26 GHz 7,19

Interconversion of linearly viscoelastic material ... - Moodle

VPS HOSTING - ResellerClubvps 1 vps 2 vps 3 vps 4 vps 5 vps 6 vps 7 vps 8 vps 9 vps 10 cpu 0.88 ghz 1.17 ghz 1.47 ghz 1.96 ghz 2.43 ghz 3.20 ghz 4.40 ghz 6.40 ghz 8.80 ghz 11.73 ghz

Linearly Compressed Pages: A Main Memory -

2014 SERAFINI Reading the Visual

Security Analysis of Linearly Filtered NLFSRs

Charles Serafini WIM_Crit

r BER decreases linearly with SNR

A GLOBALLY CONVERGENT LINEARLY CONSTRAINED

Frank Serafini Dani Kachorsky Maria Goff ABSTRACTjolle.coe.uga.edu/wp-content/uploads/2015/10/Article-5_Serafini-FINAL.pdfDr. Frank Serafini is an award winning children’s author

Assignment 2: Serafini

Design & Comparison of Linearly Polarized Rectangular ...

5. 3 Linearly Independence Definition

Evaluation of Circular Pin Fed Linearly Polarized Patch ...article.comjournal.org/pdf/10.11648.j.com.20180602.11.pdf · The proposed antenna design uses 0.5 GHZ frequency band and

Dussault / Serafini Solutions - TradeKeyimgusr.tradekey.com/images/uploadedimages/brochures/5/0/... · 2012-02-22 · Dussault – Serafini The Real Story! ... I then envisioned the

ANTENNA DETUNING ON PROJECTILES · 2020. 7. 24. · 9 Return loss measurements for linearly polarized patch antenna 12 10 FSPL versus distance at 2.25 GHz 13. ... loss of circular

Charlie Serafini - Life Drawing Submission