Download - Multi-core processors and multithreading

Multi-core processors and multithreading

1 iCSC2015, Pawel Szostek, CERN

Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape

Lecture 2


Paweł Szostek

CERN

Inverted CERN School of Computing, 23-24 February 2015



ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES

Multi-core processors and multithreading: part 1



CPU evolution In the past manufacturers

were increasing frequency Transistors were invested

into larger caches and more powerful cores

From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over)

Thermal Design Power (TDP) is stalled at ~150W

Why higher clock speed increases the power consumption?



Interlude: power dissipation In the past, there were no

power dissipation issues Heat density (W/cm3) in a

modern CPU approaches the same level as in nuclear reactor [1]

“Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use)

This can lead to caveats, see AVX

[1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers



Interlude: manufacturing technology

Flu virus 14nm process

transistor

120nm



Simultaneous Multi-Threading

Problem: when executing a stream of instructions, even with out-of-order execution, a CPU cannot keep all the execution units constantly busy

Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.



Simultaneous Multi-Threading (II) Solution: we can utilize idle execution units with a different thread

SMT is a hardware feature that can be turned on/off in the BIOS

Most of the hardware resources (including caches) are shared Needs a separate fetching unit

Can both speed up and slow down execution (see next slide)



Simultaneous Multi-Threading (III)

SMT Workloads from HEP-SPEC06 benchmark

Many instances of single-threaded processes run in parallel

Different scalability and reactions to SMT

Cache utilization is the most important factor in SMT impact



Simultaneous Multi-Threading (IV) Idea: we might want to exploit SMT

by running a main thread and a helper thread on the same physical core

Example: list or tree traversal the role of the helper thread is to

prefetch the data helper thread works in front of

the main thread by accessing data ahead of the main thread

think of it as an interesting example of exploiting the hardware

source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors”



Non-Uniform Memory Access Multi-processor architecture,

where memory access time depends on location of the memory wrt. the processor

Makes accesses fast, when the memory is “close” to the processor

There is a performance hit when accessing the “foreign” memory

Lowers down the pressure on the memory bus



Cluster-on-die

Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)

Solution: split the memory on one socket into two nodes



Intel architectural extensions Extension Generation/year Value added MMX Pentium

MMX/1997 64b registers with packed data types, only integer operations

SSE Pentium III/1999 128b registers (XMM), 32b float only

SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op

instructions AVX2 Haswell/2013 Integer instructions in YMM

registers, FMA AVX512 Skylake/2016 512b registers

Hardware evolves → programmers and compilers need to adapt



Intel extensions example – AVX2 AVX2 is the latest extension from Intel

Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)

Creative application – Padé approximant

VDT is a math vector library using Padé approximant – libm plug&play replacement with speed-ups reaching 10x

𝑅𝑅 𝑥𝑥 = 𝑎𝑎0 + 𝑎𝑎1𝑥𝑥 + 𝑎𝑎2𝑥𝑥2 + ⋯+ 𝑎𝑎𝑛𝑛𝑥𝑥𝑛𝑛

1 + 𝑏𝑏1𝑥𝑥 + 𝑏𝑏2𝑥𝑥2 + ⋯+ 𝑏𝑏𝑚𝑚𝑥𝑥𝑚𝑚

=𝑎𝑎0 + 𝑥𝑥(𝑎𝑎1 + 𝑥𝑥 𝑎𝑎2 + 𝑥𝑥 … + 𝑥𝑥𝑎𝑎𝑛𝑛 … )

1 + 𝑥𝑥(𝑏𝑏1+𝑥𝑥(𝑏𝑏2 + 𝑥𝑥 … + 𝑥𝑥𝑏𝑏𝑚𝑚 … )



CPU improvements summary

Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more

(see: dark silicon) Hyper-threading Medium overhead, up to

30% performance improvement

Can double workload’s memory footprint, possible cache pollution

Architectural changes

Increase versatility and performance, works well with existing software

Huge design overhead, happen ~every 3 years

Microarchitectural changes

Transparent for the users Huge design overhead

More cores Low design overhead, easy to implement, great scalability

Requires heavily-parallel software

Slid

e in

spira

tion:

A. N

owak

“Mul

ticor

e A

rchi

tect

ures

”

Common ways to improve CPU performance



PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE




Concurrency vs. parallelism

Do concurrent (not parallel) programs need synchronization to access shared resources? Why?



Race conditions

What will be value of n after both threads finish their work?



Race conditions (II)



Thread-level parallelism in Python C++ parallelism skipped on purpose – already covered at CSC

Python is not a performance-oriented language, but can be made less slow

We can still use threading module to benefit from parallel IO operations via threads by relying on OS

Example is deferred to the synchronization slides.

But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?



Thread-level parallelism in Python (II)

We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though high memory footprint no resource sharing every worker is a separate process

from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))



CSC Refresher: vector operations

Problem: all the arithmetic operations are executed one element at a time

Solution: introduce vector operations and vector registers

What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice?



Auto-vectorization in gcc

Vectorization candidate: (inner) loops.

Will only work with more recent gcc versions (>4.6)

By default, auto-vectorization in gcc is disabled

There are tens of optimization flag, but it’s good to retain at least a couple: -mtune=ARCH, -march=ARCH -O2, -O3, -Ofast -ftree-vectorize



Vectorization reports

Compiler can tell us which loop was not vectorized and why gcc: -ftree-vectorize-verbose=[0-9] icc: -vec-report=[0-7]

List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function.



ICC __declspec(cpu_specific(generic)) int foo() { return 0; } __declspec(cpu_specific(core_i7_sse4_2)) int foo() { return 1; }

Intel architectural extensions (II) Compiler is capable of producing different versions of the same

function for different architectures (so called Automatic CPU dispatch)

A run-time check is added to the output code

in ICC –axARCH can be used instead

GCC __attribute__ ((target(“default”))) int foo() { return 0; } __attribute__((target(“sse4.2”))) int foo() { return 1; }



Vectorization in C++

Possible to use intrinsics, but very cumbersome and “write-only”

Many libraries to approach vectorization, the choice is not easy

Example: Agner Fog’s Vector Class

float a[8], b[8], c[8]; … for (int i=0; i<8; ++i) { c[i] = a[i] + b[i]*1.5f; }

#include “vectorclass.h” float a[8], b[8], c[8]; … Vec8f avec, bvec, cvec; avec.load(a); bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c);



Vectorization in Python

Vectorization in Python is possible, but requires extra modules and extra care

numpy has a complete set of vectorized mathematical operations, requires usage of special types instead of built-in ones. Array notation expressions are vectorized

Any step outside of numpy world will dramatically slow down execution

Gains not only from vectorization, but also from using C types under the hood

Example: roots of quadratic equations (see next slide)



Vectorization in Python - example import numpy as np from cmath import sqrt from itertools import izip # generate 1M coefficients a=np.random.randn(1000000) b=np.random.randn(1000000) c=np.random.randn(1000000) def solve_numpy(a, b, c): delta=b*b-4*a*c delta_s=np.sqrt(delta+0.j)) x1=((-b+delta_s)/(2*a)) x2=((-b-delta_s)/(2*a)) return (x1, x2)

def solve_python(a, b, c): for ai,bi,ci in izip(a,b,c): delta=bi*bi-4*ai*ci delta_s=sqrt(delta) x1=((-bi+delta_s)/(2*ai)) x2=((-bi-delta_s)/(2*ai)) yield (x1, x2) timeit list(solve_python(a,b,c)) 1 loops, best of 3: 15 s timeit list(solve_numpy(a,b,c)) 10 loops, best of 3: 105 ms

Wow! Where this speed-up comes from?



Accessing shared resources in Python

C++ locking skipped on purpose – covered by Danilo

threading.Lock – the lowest synchronization primitive, possible states: released and acquired Provides two operations: Lock.acquire(blocking=False)

and Lock.release()

threading.Rlock – recurrent lock, can be acquired multiple times.

Queue.Queue – synchronized queue for message/object passing



Shared resources - example

import Queue from threading import \ Thread import urllib2 from BeautifulSoup import\ BeautifulSoup # hosts = [something, something] url_queue = Queue.Queue() html_queue = Queue.Queue()

class FetchThread(Thread): def __init__(self, url_queue, html_queue): Thread.__init__(self) self.url_queue = url_queue self.html_queue = html_queue def run(self): while True: host = self.queue.get() url = urllib2.urlopen(host) chunk = url.read() self.out_queue.put(chunk) self.queue.task_done()

Multithreaded application for fetching and processing webpages Communication through synchronized queues



Shared resources – example cont’d

class MineThread(Thread): def __init__(self, html_queue): Thread.__init__(self) self.html_queue = \ html_queue def run(self): while True: c = self.html_queue.get() soup = BeautifulSoup(c) titles = \ soup.findAll([‘title’] print(titles) self.html_queue.\ task_done()

def main(): for i in range(5): t = FetchThread(url_queue, html_queue) t.setDaemon(True) t.start() for host in hosts: url_queue.put(host) for i in range(5): dt = MineThread(html_queue) dt.setDaemon(True) dt.start() queue.join(); html_queue.join() main()



EVOLUTION OF COMPUTING LANDSCAPE IN THE FUTURE




Intel tick-tock model



Intel Xeon Phi

openlab collaborating since 2008

PCIe co-processor with 61 cores * 4-way SMT

1TFLOPS peak performance 512 bit vectors

Next generation: even more

cores, 3 times more performance, x86-64 compatible, standalone CPU… maybe in desktops?

But… are my applications ready for a such massive parallelism?



ARM 64 (AArch64) It’s all about low power

64-bit memory addressing provides support for large memory (>4GB)

RISC architecture

Common software ecosystem with x86-64, uses same management standards

CISC also expanding in this direction

Energy efficiency scalability

figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”



Take-home messages

Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)

NUMA and COD are another “complex stuff” that a programmer has to keep in mind

Parallelization is possible not only with C++

Not everything that looks like an improvement gives you better performance (e.g. AVX)

Multi-threaded applications always require synchronization to protect shared resources

Auto-vectorization is a speed-up for free