Multi-core processors and multithreading
1 iCSC2015, Pawel Szostek, CERN
Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape
Lecture 2
Multi-core processors and multithreading
Paweł Szostek
CERN
Inverted CERN School of Computing, 23-24 February 2015
Multi-core processors and multithreading
2 iCSC2015, Pawel Szostek, CERN
ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES
Multi-core processors and multithreading: part 1
Multi-core processors and multithreading
3 iCSC2015, Pawel Szostek, CERN
CPU evolution In the past manufacturers
were increasing frequency Transistors were invested
into larger caches and more powerful cores
From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over)
Thermal Design Power (TDP) is stalled at ~150W
Why higher clock speed increases the power consumption?
Multi-core processors and multithreading
4 iCSC2015, Pawel Szostek, CERN
Interlude: power dissipation In the past, there were no
power dissipation issues Heat density (W/cm3) in a
modern CPU approaches the same level as in nuclear reactor [1]
“Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use)
This can lead to caveats, see AVX
[1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers
Multi-core processors and multithreading
5 iCSC2015, Pawel Szostek, CERN
Interlude: manufacturing technology
Flu virus 14nm process
transistor
120nm
Multi-core processors and multithreading
6 iCSC2015, Pawel Szostek, CERN
Simultaneous Multi-Threading
Problem: when executing a stream of instructions, even with out-of-order execution, a CPU cannot keep all the execution units constantly busy
Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.
Multi-core processors and multithreading
7 iCSC2015, Pawel Szostek, CERN
Simultaneous Multi-Threading (II) Solution: we can utilize idle execution units with a different thread
SMT is a hardware feature that can be turned on/off in the BIOS
Most of the hardware resources (including caches) are shared Needs a separate fetching unit
Can both speed up and slow down execution (see next slide)
Multi-core processors and multithreading
8 iCSC2015, Pawel Szostek, CERN
Simultaneous Multi-Threading (III)
SMT Workloads from HEP-SPEC06 benchmark
Many instances of single-threaded processes run in parallel
Different scalability and reactions to SMT
Cache utilization is the most important factor in SMT impact
Multi-core processors and multithreading
9 iCSC2015, Pawel Szostek, CERN
Simultaneous Multi-Threading (IV) Idea: we might want to exploit SMT
by running a main thread and a helper thread on the same physical core
Example: list or tree traversal the role of the helper thread is to
prefetch the data helper thread works in front of
the main thread by accessing data ahead of the main thread
think of it as an interesting example of exploiting the hardware
source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors”
Multi-core processors and multithreading
10 iCSC2015, Pawel Szostek, CERN
Non-Uniform Memory Access Multi-processor architecture,
where memory access time depends on location of the memory wrt. the processor
Makes accesses fast, when the memory is “close” to the processor
There is a performance hit when accessing the “foreign” memory
Lowers down the pressure on the memory bus
Multi-core processors and multithreading
11 iCSC2015, Pawel Szostek, CERN
Cluster-on-die
Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)
Solution: split the memory on one socket into two nodes
Multi-core processors and multithreading
12 iCSC2015, Pawel Szostek, CERN
Intel architectural extensions Extension Generation/year Value added MMX Pentium
MMX/1997 64b registers with packed data types, only integer operations
SSE Pentium III/1999 128b registers (XMM), 32b float only
SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op
instructions AVX2 Haswell/2013 Integer instructions in YMM
registers, FMA AVX512 Skylake/2016 512b registers
Hardware evolves → programmers and compilers need to adapt
Multi-core processors and multithreading
13 iCSC2015, Pawel Szostek, CERN
Intel extensions example – AVX2 AVX2 is the latest extension from Intel
Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)
Creative application – Padé approximant
VDT is a math vector library using Padé approximant – libm plug&play replacement with speed-ups reaching 10x
𝑅𝑅 𝑥𝑥 = 𝑎𝑎0 + 𝑎𝑎1𝑥𝑥 + 𝑎𝑎2𝑥𝑥2 + ⋯+ 𝑎𝑎𝑛𝑛𝑥𝑥𝑛𝑛
1 + 𝑏𝑏1𝑥𝑥 + 𝑏𝑏2𝑥𝑥2 + ⋯+ 𝑏𝑏𝑚𝑚𝑥𝑥𝑚𝑚
=𝑎𝑎0 + 𝑥𝑥(𝑎𝑎1 + 𝑥𝑥 𝑎𝑎2 + 𝑥𝑥 … + 𝑥𝑥𝑎𝑎𝑛𝑛 … )
1 + 𝑥𝑥(𝑏𝑏1+𝑥𝑥(𝑏𝑏2 + 𝑥𝑥 … + 𝑥𝑥𝑏𝑏𝑚𝑚 … )
Multi-core processors and multithreading
14 iCSC2015, Pawel Szostek, CERN
CPU improvements summary
Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more
(see: dark silicon) Hyper-threading Medium overhead, up to
30% performance improvement
Can double workload’s memory footprint, possible cache pollution
Architectural changes
Increase versatility and performance, works well with existing software
Huge design overhead, happen ~every 3 years
Microarchitectural changes
Transparent for the users Huge design overhead
More cores Low design overhead, easy to implement, great scalability
Requires heavily-parallel software
Slid
e in
spira
tion:
A. N
owak
“Mul
ticor
e A
rchi
tect
ures
”
Common ways to improve CPU performance
Multi-core processors and multithreading
15 iCSC2015, Pawel Szostek, CERN
PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE
Multi-core processors and multithreading: part 2
Multi-core processors and multithreading
16 iCSC2015, Pawel Szostek, CERN
Concurrency vs. parallelism
Do concurrent (not parallel) programs need synchronization to access shared resources? Why?
Multi-core processors and multithreading
17 iCSC2015, Pawel Szostek, CERN
Race conditions
What will be value of n after both threads finish their work?
Multi-core processors and multithreading
18 iCSC2015, Pawel Szostek, CERN
Race conditions (II)
Multi-core processors and multithreading
19 iCSC2015, Pawel Szostek, CERN
Thread-level parallelism in Python C++ parallelism skipped on purpose – already covered at CSC
Python is not a performance-oriented language, but can be made less slow
We can still use threading module to benefit from parallel IO operations via threads by relying on OS
Example is deferred to the synchronization slides.
But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?
Multi-core processors and multithreading
20 iCSC2015, Pawel Szostek, CERN
Thread-level parallelism in Python (II)
We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though high memory footprint no resource sharing every worker is a separate process
from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))
Multi-core processors and multithreading
21 iCSC2015, Pawel Szostek, CERN
CSC Refresher: vector operations
Problem: all the arithmetic operations are executed one element at a time
Solution: introduce vector operations and vector registers
What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice?
Multi-core processors and multithreading
22 iCSC2015, Pawel Szostek, CERN
Auto-vectorization in gcc
Vectorization candidate: (inner) loops.
Will only work with more recent gcc versions (>4.6)
By default, auto-vectorization in gcc is disabled
There are tens of optimization flag, but it’s good to retain at least a couple: -mtune=ARCH, -march=ARCH -O2, -O3, -Ofast -ftree-vectorize
Multi-core processors and multithreading
23 iCSC2015, Pawel Szostek, CERN
Vectorization reports
Compiler can tell us which loop was not vectorized and why gcc: -ftree-vectorize-verbose=[0-9] icc: -vec-report=[0-7]
List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function.
Multi-core processors and multithreading
24 iCSC2015, Pawel Szostek, CERN
ICC __declspec(cpu_specific(generic)) int foo() { return 0; } __declspec(cpu_specific(core_i7_sse4_2)) int foo() { return 1; }
Intel architectural extensions (II) Compiler is capable of producing different versions of the same
function for different architectures (so called Automatic CPU dispatch)
A run-time check is added to the output code
in ICC –axARCH can be used instead
GCC __attribute__ ((target(“default”))) int foo() { return 0; } __attribute__((target(“sse4.2”))) int foo() { return 1; }
Multi-core processors and multithreading
25 iCSC2015, Pawel Szostek, CERN
Vectorization in C++
Possible to use intrinsics, but very cumbersome and “write-only”
Many libraries to approach vectorization, the choice is not easy
Example: Agner Fog’s Vector Class
float a[8], b[8], c[8]; … for (int i=0; i<8; ++i) { c[i] = a[i] + b[i]*1.5f; }
#include “vectorclass.h” float a[8], b[8], c[8]; … Vec8f avec, bvec, cvec; avec.load(a); bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c);
Multi-core processors and multithreading
26 iCSC2015, Pawel Szostek, CERN
Vectorization in Python
Vectorization in Python is possible, but requires extra modules and extra care
numpy has a complete set of vectorized mathematical operations, requires usage of special types instead of built-in ones. Array notation expressions are vectorized
Any step outside of numpy world will dramatically slow down execution
Gains not only from vectorization, but also from using C types under the hood
Example: roots of quadratic equations (see next slide)
Multi-core processors and multithreading
27 iCSC2015, Pawel Szostek, CERN
Vectorization in Python - example import numpy as np from cmath import sqrt from itertools import izip # generate 1M coefficients a=np.random.randn(1000000) b=np.random.randn(1000000) c=np.random.randn(1000000) def solve_numpy(a, b, c): delta=b*b-4*a*c delta_s=np.sqrt(delta+0.j)) x1=((-b+delta_s)/(2*a)) x2=((-b-delta_s)/(2*a)) return (x1, x2)
def solve_python(a, b, c): for ai,bi,ci in izip(a,b,c): delta=bi*bi-4*ai*ci delta_s=sqrt(delta) x1=((-bi+delta_s)/(2*ai)) x2=((-bi-delta_s)/(2*ai)) yield (x1, x2) timeit list(solve_python(a,b,c)) 1 loops, best of 3: 15 s timeit list(solve_numpy(a,b,c)) 10 loops, best of 3: 105 ms
Wow! Where this speed-up comes from?
Multi-core processors and multithreading
28 iCSC2015, Pawel Szostek, CERN
Accessing shared resources in Python
C++ locking skipped on purpose – covered by Danilo
threading.Lock – the lowest synchronization primitive, possible states: released and acquired Provides two operations: Lock.acquire(blocking=False)
and Lock.release()
threading.Rlock – recurrent lock, can be acquired multiple times.
Queue.Queue – synchronized queue for message/object passing
Multi-core processors and multithreading
29 iCSC2015, Pawel Szostek, CERN
Shared resources - example
import Queue from threading import \ Thread import urllib2 from BeautifulSoup import\ BeautifulSoup # hosts = [something, something] url_queue = Queue.Queue() html_queue = Queue.Queue()
class FetchThread(Thread): def __init__(self, url_queue, html_queue): Thread.__init__(self) self.url_queue = url_queue self.html_queue = html_queue def run(self): while True: host = self.queue.get() url = urllib2.urlopen(host) chunk = url.read() self.out_queue.put(chunk) self.queue.task_done()
Multithreaded application for fetching and processing webpages Communication through synchronized queues
Multi-core processors and multithreading
30 iCSC2015, Pawel Szostek, CERN
Shared resources – example cont’d
class MineThread(Thread): def __init__(self, html_queue): Thread.__init__(self) self.html_queue = \ html_queue def run(self): while True: c = self.html_queue.get() soup = BeautifulSoup(c) titles = \ soup.findAll([‘title’] print(titles) self.html_queue.\ task_done()
def main(): for i in range(5): t = FetchThread(url_queue, html_queue) t.setDaemon(True) t.start() for host in hosts: url_queue.put(host) for i in range(5): dt = MineThread(html_queue) dt.setDaemon(True) dt.start() queue.join(); html_queue.join() main()
Multi-core processors and multithreading
31 iCSC2015, Pawel Szostek, CERN
EVOLUTION OF COMPUTING LANDSCAPE IN THE FUTURE
Multi-core processors and multithreading: part 3
Multi-core processors and multithreading
32 iCSC2015, Pawel Szostek, CERN
Intel tick-tock model
Multi-core processors and multithreading
33 iCSC2015, Pawel Szostek, CERN
Intel Xeon Phi
openlab collaborating since 2008
PCIe co-processor with 61 cores * 4-way SMT
1TFLOPS peak performance 512 bit vectors
Next generation: even more
cores, 3 times more performance, x86-64 compatible, standalone CPU… maybe in desktops?
But… are my applications ready for a such massive parallelism?
Multi-core processors and multithreading
34 iCSC2015, Pawel Szostek, CERN
ARM 64 (AArch64) It’s all about low power
64-bit memory addressing provides support for large memory (>4GB)
RISC architecture
Common software ecosystem with x86-64, uses same management standards
CISC also expanding in this direction
Energy efficiency scalability
figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”
Multi-core processors and multithreading
35 iCSC2015, Pawel Szostek, CERN
Take-home messages
Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)
NUMA and COD are another “complex stuff” that a programmer has to keep in mind
Parallelization is possible not only with C++
Not everything that looks like an improvement gives you better performance (e.g. AVX)
Multi-threaded applications always require synchronization to protect shared resources
Auto-vectorization is a speed-up for free
Top Related