NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/files/1.pdf · Example SIMD machine:...
Transcript of NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/files/1.pdf · Example SIMD machine:...
NUMERICAL PARALLEL COMPUTING
NUMERICAL PARALLEL COMPUTINGLecture 1, March 23, 2007: Introduction
Peter Arbenz
Institute of Computational Science, ETH Z”urichE-mail: [email protected]
http://people.inf.ethz.ch/arbenz/PARCO/
1
NUMERICAL PARALLEL COMPUTING
Organization
Organization: People/Exercises
1. Lecturer:Peter Arbenz, CAB G69.3 (Universitatsstrasse 6)Tel. 632 7432Lecture: Friday 8-10 CAB [email protected]
2. Assistant:Marcus Wittberger, CAB G65.1, Tel. 632 [email protected] 1: Friday 14-16 IFW D31
3. Assistant:Cyril Flaig, CAB F63.2, Tel. 632 [email protected] 2: Friday 10-12 IFW D31
2
NUMERICAL PARALLEL COMPUTING
Introduction
What is parallel computing
What is parallel computing [in this course]
A parallel computer is a collection of processors, that cansolve big problems quickly by means of well coordinated col-laboration.
Parallel computing is the use of multiple processors to executedifferent parts of the same program concurrently or simulta-neously.
3
NUMERICAL PARALLEL COMPUTING
Introduction
What is parallel computing
An example of parallel computing
Assume you are to sort a deck of playing cards (by suits, then byrank). If you do it properly you can complete this task faster if youhave people that help you (parallel processing).Note that the work done in parallel is not smaller than the workdone sequentially. However, the solution time (wall clock time) isreduced.Notice further that the helpers have to somehow communicatetheir partial results. This causes some overhead. Clearly, there maybu too many helpers (e.g. if there are more than 52). One mayobserve a relation of speedup vs. number of helper as depicted inFig. 1 on the next slide.
4
NUMERICAL PARALLEL COMPUTING
Introduction
What is parallel computing
Figure: Sorting a deck of cards5
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
Why parallel computing
I Runtimewant to reduce wall clock time [in e.g. time criticalapplications like weather forecast].
I Memory spaceSome large applications (grand challenges) need a largenumber of degrees of freedom to provide meaningful results.[Reasonably short time step, discretization has to besufficiently fine, c.f. again weather forecast]A large number of small processors probably has a muchbigger (fast) memory than a single large machine (PC clustervs. HP Superdome)
6
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
The challenges of parallel computing
Idea is simple: Connect a sufficient amount of hardware and youcan solve arbitrarily large problems. (Interconnection network forprocessors and memory?)BUT, there are a few problems here...
Let’s look at the processors. By Moore’s law the number oftransistors per square inch doubles every 18 – 24 months, cf. Fig. 2.
Remark: If your problem is not too big you may want to wait untilthere is a machine that is sufficiently fast to do the job.
7
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
Figure: Moore’s law
8
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
9
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
How did this come about?
I Clock rate (+30% / year)I increases power consumption
I Number of transistors (+60-80% / year)I Parallelization at bit levelI Instruction level parallelism (pipelining)I Parallel functional unitsI Dual-core / Multi-core processors
10
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
Instruction level parallelism
Figure: Pipelining of an instruction with 4 subtasks: fetch(F),decode(D), execute(E), write back(W)
11
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
Some superscalar processors
issuable instructions clock rateprocessor max ALU FPU LS B (MHz) year
Intel Pentium 2 2 1 2 1 66 1993Intel Pentium II 3 2 1 3 1 450 1998Intel Pentium III 3 2 1 2 1 1100 1999Intel Pentium 4 3 3 2 2 1 1400 2001AMD Athlon 3 3 3 2 1 1330 2001
Intel Itanium 2 6 6 2 4 3 1500 2004AMD Opteron 3 3 3 2 1 1800 2003
ALU: interger instructions; FPU floating-point instructions;LS: load-store instructions; B: branch instructions;clock rate at time of introduction.[Rauber & Runger: Parallele Programmierung]
12
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
One problem of high performance computing (and of parallelcomputing in particular) is caused by the fact that the access timeto memory has not improved accordingly, see Fig. 4. Memoryperformance doubles in 6 years only [Hennessy & Patterson]
Figure: Memory vs. CPU performance
13
NUMERICAL PARALLEL COMPUTING
Introduction
Why parallel computing
To alleviate this problem, memory hierarchies with varying accesstimes have been introduced (several levels of caches).But the further away data are from the processor, the longer theytake to get and store.
Data access is everything in determining performance!
Sources of performance losses that are specific to parallelcomputing are
I communication overhead: synchronization, sending messages,etc.(Data is not only in the processor’s own slow memory, buteven on a remote processor’s own memory.)
I unbalanced loads: the different processors do not have thesame amount of work to do.
14
NUMERICAL PARALLEL COMPUTING
Outline of the lecture
Outline of the lecture
I Overview of parallel programming, Terminology
I SIMD programming on the Pentium (Parallel computers arenot just in RZ, most likely there is one in your backpack!)
I Shared memory programming, OpenMP
I Distributed memory programming, Message Passing Interface(MPI)
I Solving dense systems of equations with ScaLAPACK
I Solving sparse systems iteratively with Trilinos
I Preconditioning, reordering (graph partitioning with METIS),parallel file systems
I Fast Fourier Transform (FFT)
I Applications: Particle methods, (bone) structure analysis
For details see PARCO home page.15
NUMERICAL PARALLEL COMPUTING
Exercises
Exercises’ Objectives
1. To study 3 modes of parallelismI Instruction level (chip level, board level, ...). SIMDI Shared memory programming on ETH compute server. MIMDI Distributed memory programming on Linux cluster. MIMD
2. Several computational areas will be studiedI Linear algebra (BLAS, iterative methods)I FFT and related topics (N-body simulation)
3. Models and programming (Remember Portability!)I Examples will be in C/C++ (calling Fortran routines)I OpenMP (HP Superdome Stardust/Pegasus)I MPI (Opteron/Linux cluster Gonzales)
4. We expect you to solve 6 out of 8 exercises.
16
NUMERICAL PARALLEL COMPUTING
References
References
1. P. S. Pacheco: Parallel programming with MPI. MorganKaufmann, San Francisco CA 1997.http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-339-5
2. R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J.McDonald: Parallel programming in OpenMP. MorganKaufmann, San Francisco CA 2001.http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-671-8
3. W. P. Petersen and P. Arbenz: Introduction to ParallelComputing, Oxford Univ. Press, 2004.http://www.oup.co.uk/isbn/0-19-851577-4
Complementary literature is found on the PARCO home page.
17
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Flynn’s Taxonomy of Parallel Systems
In the taxonomy of Flynn parallel systems are classified accordingto the number of instruction streams and data streams.
M. Flynn: Proc. IEEE 54 (1966), pp. 1901–1909.18
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
SISD: Single Instruction stream - Single Data stream
The classical von Neumann machine.I processor: ALU, registers
I memory holds data and program
I bus (a collection of wires) = von Neumann bottleneck
Today’s PCs or workstations are no longer true von Neumannmachines (superscalar processors, pipelining, memory hierarchies)
19
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
SIMD: Single Instruction stream - Multiple Data stream
SIMD: Single Instruction stream - Multiple Data stream
During each instruction cycle the central control unit broadcastsan instruction to the subordinate processors and each of themeither executes the instruction or is idle.At any given time a processor is “active” executing exactly thesame instruction as all other processors in a completelysynchronous way, or it is idle.
20
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Vector computers
Example SIMD machine: Vector computers
Vector computers were a kind of SIMD parallel computer. Vectoroperations on machines like the Cray-1, -2, X-MP, Y-MP,. . .worked essentially in 3 steps:
1. copy data (like vectors of 64 floating point numbers) into thevector register(s)
2. apply the same operation to all the elements in the vectorregister(s)
3. copy the result from the vector register(s) to the main memory
These machines did not have a cache but a very fast memory.(Some people say that they only had a cache. But there were nocachelines anyway.)The above three steps could overlap: “the pipelines could bechained”.
21
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Vector computers
A variant of SIMD: a pipeline
Complicated operations often take more than one cycle tocomplete. If such an operation can be split in several stages thateach take one cycle, then a pipeline can (after a startup phase)produce a result in each clock cycle.Example: elementwise multiplication of 2 integer arrays of length n.
c = a. ∗ b⇐⇒ ci = ai ∗ bi , 0 ≤ i < n.
Let the numbers ai , bi , ci be split in four fragments (bytes):
ai = [ai3, ai2, ai1, ai0]bi = [bi3, bi2, bi1, bi0]ci = [ci3, ci2, ci1, ci0]
Thenci ,j = ai ,j ∗ bi ,j + carry from ai ,j−1 ∗ bi ,j−1
This gives rise to a pipeline with four stages.22
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Vector computers
23
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Pentium 4
Example SIMD machine: Pentium 4
I The Pentium III and Pentium IV support SIMD programmingby means of their Streaming SIMD extensions (SSE, SSE2).
I The Pentium III has vector registers, called Multimedia orMMX Registers. There are 8 (eight) of them! They are 64 bitwide and were intended for computations with integer arrays.
I The Pentium 4 additionally has 8 XMM registers that are 128bit wide.
I The registers are configurable. They support integer andfloating point operations. (XMM also supports double (64bit) operations.)
I The registers can be considered vector registers. Although theregisters are very short they mean the return of vectorcomputing on the desktop.
I We will investigate how these registers can be used and whatwe can expect in terms of performance.
24
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Pentium 4
25
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Cell processor
Example SIMD machine: Cell processor
I IBM in collaboration with Sony and Toshiba
I Playstation 3
I Multicore processor: one Power Processing Element (PPE)
I 8 SIMD co-processors (SPE’s)
I 4 FPUs (32 bit), 4 GHz clock, 32 GFlops per SPE
I 1/4 TFlop per Cell, 80 Watt
I SPE’s programmed with compiler instrinsics
Next page: EIB = element interface bus; LS = local store; MIC =memory interface controller; BIC = bus interface controller
26
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Cell processor
DualXDR
16B/Zyklus
SPU SPU SPU SPU SPU SPU SPU
LSLSLSLSLSLSLSLS
EIB (bis 96 B/Zyklus)
64−Bit Power Architektur RRAC I/O
Synergetic Processing Elements
SPU
MIC BIC
PPU L1
L2
[Rauber & Runger: Parallele Programmierung]27
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
Example SIMD machine: Cell processor
28
NUMERICAL PARALLEL COMPUTING
Flynn’s Taxonomy of Parallel Systems
MIMD: Multiple Instruction stream - Multiple Data stream
MIMD: Multiple Instruction stream - Multiple Data stream
Each processor can execute its own instruction stream on its owndata independently from the other processors. Each processor is afull-fledged CPU with both control unit and ALU. MIMD systemsare asynchronous.
29
NUMERICAL PARALLEL COMPUTING
Memory organization
shared memory machines
Memory organization
Most parallel machines are MIMD machines.MIMD machines are classified by their memory organization
I shared memory machines (multiprocessors)I parallel processes, threadsI communication by means of shared variablesI data dependencies possible, race conditionI multi-core processors
30
NUMERICAL PARALLEL COMPUTING
Memory organization
shared memory machines
Interconnection network
I network usually dynamic: crossbar switch
Crossbar switch with n processors and m memory modules.On the right the possible switch states.
I uniform access, scalable, very many wires ⇐ very expensive,used for only limited number of processors.
31
NUMERICAL PARALLEL COMPUTING
Memory organization
distributed memory machines
I distributed memory machines (multicomputers)I all data are local to some processor,I programmer responsible for data placementI communication by message passingI easy / cheap to build −→ (Beowulf) clusters
32
NUMERICAL PARALLEL COMPUTING
Memory organization
distributed memory machines
Interconnection network
I network usually static: Array, ring, meshes, tori, hypercubes
I processing elements usually connected to network throughrouters. Routers can pipeline messages.
33
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
HP Superdome
HP Superdome (Stardust / Pegasus)
The HP Superdome systems are large multi-purpose parallelcomputers. They serve as the Application Servers at ETH.For information seehttp://www.id.ethz.ch/services/list/comp_zentral/
Figure: HP Superdome Stardust (3 cabinets left) and Pegasus (right)34
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
HP Superdome
Superdome specifications
I Stardust: 64 Itanium-2 (1,6GHz) dual-core processors, 256GBmain memory
I Pegasus: 32 Itanium-2 (1,5GHz) dual-core processors, 128GBmain memory
I HP/UX (Unix)I Shared memory programming modelI 4-processor cells are connected through crossbarI ccNUMA: Cache-coherent, Non-Uniform Memory AccessI Organisation: batch processing. Jobs are submitted to LSF
(Load Sharing Facility)Interactive access possible on Pegasus.
I System manager: [email protected] We will use Pegasus for experiments with shared memory
programming with C and compiler directives (OpenMP).
35
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
(Speedy) Gonzales
Gonzales Cluster
(Speedy) Gonzales is a high-performance Linux cluster based on288 dual-processor 64-bit AMD Opteron 250 processors and aQuadrics QsNet II interconnect.
Figure: An image of the old Linux cluster Asgard
36
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
(Speedy) Gonzales
Cluster specifications
I One master node, two login nodes (Gonzales) and three fileservers, of 288 compute nodes.
I 1 node = two 64-bit AMD Opteron 2.4 GHz processors, 8 GBof main memory (shared by the two processors).Globale view: distributed memory
I All nodes connected through Gb-Ethernet switch (NFS andother services).
I Compute nodes inter-connected via a two-layer QuadricsQsNet II network. Sustained bandwidth 900 MB/s. Latency1µsec between any two nodes in the cluster.
37
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
(Speedy) Gonzales
Figure: Gonzales fat tree topology
Each 64-way switch is based on a 3-stage fat-tree topology. Thetop-level switch adds another 4th stage to this fat-tree.http://clusterwiki.ethz.ch/wiki/index.php/Gonzales.
38
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
(Speedy) Gonzales
I Nodes run SuSE Linux (64-bit) with some modifications forthe Quadrics interconnect.
I The login nodes have a more or less complete linux system,including compilers, debuggers, etc., while the compute nodeshave a minimal system with only the commands and librariesnecessary to run applications.
I The AMD Opteron runs both 32-bit and 64-bit applications.
I Compilers: C/C++, Fortran 77/90 & HPF.
I Note: all parallel applications must be recompiled (in 64-bit)and linked with the optimized MPI library from Quadrics.
I Jobs are submitted from the login nodes to the compute nodesvia the LSF batch system, exclusively. Users are not allowedto login or execute remote commands on the compute nodes.
I System manager: [email protected]
39
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
Cray XT-3 at CSCS in Manno
Cray XT-3 at CSCS in Manno
I The Cray XT3 is based on 1664 2,6 GHz AMD Opteronsingle-core processors, that are connected by the Cray SeaStarhigh speed network.
I The computer’s peak performance is 8.7 Tflop/s.
I Names: The XT3 is called “Red Storm”. The actual machinein Manno is called horizon.
I In number 94 on the list of the top 500 fastest machines (7.2Tflop/s).http://www.top500.org/list/2006/11/100Behind 2 Blue Gene (EPFL, IBM Research) and Intel Cluster(BMW Sauber)
40
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
Cray XT-3 at CSCS in Manno
Cray XT3 supercomputer is called “Red Storm” in the US. TheCSCS model has been babtized “Horizon”.
41
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
Cray XT-3 at CSCS in Manno
SeaStar router: The high-speed interconnection network exchangesdata six neighbouring knots in a 3D-torus topology.
42
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
Cray XT-3 at CSCS in Manno
Bone structure analysis
Computation of stresses in loaded human bone. FE applicationwith 1.2 · 109 degrees of freedom.
43
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
IBM Blue Gene BG/L
IBM Blue Gene BG/L
I Presently fastest parallel computer.I Blue Gene/L at Lawrence Livermore National Laboratory has
16 Racks (65’536 nodes, 131’072 processors): 280 TFlop/sI Simple, cheap, processors (PPC440), moderate cycle time
(700MHz), high performance / Watt, small main memory(512MG/node)
I 5 NetworksI 3D - torus for point-to-point messages (bandwidth 1.4 Gb/s,
latency < 6.4µs)I broadcast network for global communication, in particular
reduction operations (bandwidth 2.8 Gb/s, latency 5µs)I barrier network for global synchronization (latency 1.5µs)I control network for checking system components (temperature,
fan, . . .)I Gb-Ethernet connects I/O - nodes with external data storage.
44
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
IBM Blue Gene BG/L
IBM Blue Gene BG/L
L2−P
refe
tch−
Puf
fer
L2−P
refe
tch−
Puf
fer
eingebettetes
SharedL3
directoryfür
DRAM
mitError
CorrectionControl(ECC)M
ulti−
Por
t sha
red
SR
AM
−Puf
fer
DRAM
4MBeingebettetes
L3 Cacheoder
Speicher
Pro
zess
or−B
us
2.
7 G
B/s
128
128
5.5
GB
/sS
noop
256
256
25611 GB/s
256
128
+144 ECC
22 GB/s
1024
5.5 GB/s6 out, 6 inmit je
1.4 GB/s
3 out, 3 inmit je
2.8 GB/s
GbitEthernet Netzwerk
Kontroll−Netzwerk
Torus− BroadcastNetzwerk Netzwerk
Barrier−ControllerMemory−
PPC 440
32K/32K L1
FPUDouble−Hummer
PPC 440
32K/32K L1
FPUDouble−Hummer
45
NUMERICAL PARALLEL COMPUTING
Examples of MIMD machines
IBM Blue Gene BG/L
IBM Blue Gene BG/L at Lawrence Livermore NL
46