NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/1.pdf · 2012-02-23 · NUMERICAL...

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTINGLecture 1, February 24, 2012: Introduction

http://people.inf.ethz.ch/iyves/pnc12/

Peter Arbenz∗, Andreas Adelmann∗∗∗Chair of Computational Science, ETH Zurich,

E-mail: [email protected]∗∗Paul Scherrer Institut, Villigen

E-mail: [email protected]

Parallel Numerical Computing. Lecture 1, Feb 24, 2012 1/52

http://people.inf.ethz.ch/iyves/pnc12/

mailto:[email protected]

[email protected]


Organization

Organization: People / Exercises

1. Lecturers:

I Prof. Dr. Peter Arbenz, ETHZ, Universitatsstrasse 6, CAB H89Tel. 044 632 7432

I Dr. Andreas Adelmann, PSI, Villigen, WBGB/132Tel. 056 310 4233

2. Assistants

I Yves Ineichen, CAB H 83.2, 044 632 [email protected]

I Christof Kraus, ETHZ, CAB H 85.2, 044 633 [email protected]

The exercises take place in HG E 19


[email protected]

[email protected]


Organization

Organization: Lectures

I Basics (P. Arbenz)I Lecture 1: Introduction, Overview of parallel programming,

Moore’s lawI Lecture 2: Terminology, SIMD programmingI Lectures 3,4: MIMD, shared memory programming with

OpenMPI Lectures 5,6: MIMD, distributed memory programming with

the Message Passing Interface (MPI)



Organization

Organization: Lectures (cont.)I Applications (A. Adelmann)

I Lecture 7: Expression Templates, a vector class using ETI Lecture 8: Fast solvers on rectangular geometries. FFT, fast

Poisson solvers, and their parallelization.I Lectures 9: Fast iterative system solvers. Preconditioning,

domain decomposition.I Lecture 10, 11: N-body problems. Machinery for solving very

large problems. Independent Parallel Particle Layer (IPPL)Simulating problems with 1/r potential

I Lecture 12: Solving PDE’s in parallel. Taxonomy of problems.Parallel FDTD solver for simple rectangular waveguides.



Organization

Exercises’ Objectives

1. To study 3 modes of parallelismI Instruction level (chip level, board level, ...). SIMDI Shared memory programming. MIMDI Distributed memory programming. MIMD

2. Several computational areas will be studiedI Linear algebraI System solvingI FFT and related topics (N-body simulation)

3. Models and programming (Remember Portability!)I Examples will be in C/C++ (calling Fortran routines)I OpenMPI MPI

4. Exercises on the Opteron cluster of ETH, named Brutus:http://www.clusterwiki.ethz.ch/brutus/

5. We expect you to solve 50% of the exercises.


http://www.clusterwiki.ethz.ch/brutus/


Organization

Further reading

1. P. S. Pacheco: An Introduction to Parallel Programming.Morgan Kaufmann, 2011.

2. B. Chapman, G. Jost, R. van der Pas: Using OpenMP. MITPress, 2008.Easy to read. Examples both C and Fortran (but not C++).

3. P. S. Pacheco: Parallel programming with MPI. MorganKaufmann, San Francisco CA 1996.Easy to read. Much of the material is with Fortran

4. W. P. Petersen and P. Arbenz: Introduction to ParallelComputing. Oxford Univ. Press, 2004.Original basis of the lecture. Partly outdated.

For further references see the NumPar home page.



Introduction

What is parallel computing

What is parallel computing [in this course]

A parallel computer is a collection of processing elements,that can solve big problems quickly by means of well coordi-nated collaboration.

Parallel computing is the use of multiple processors to executedifferent parts of the same program concurrently or simulta-neously.

We do not deal (much) with client-server (master-slave)programming models as used in grid/cloud computing.



Introduction

Why parallel computing


I Runtimewant to reduce wall clock time[in e.g. time critical applications like weather forecast].

I Memory spaceSome large applications (grand challenges) need a largenumber of degrees of freedom to provide meaningful results.[Reasonably short time step, discretization has to besufficiently fine, cf. again weather forecast]A large number of small processors probably has a muchbigger (fast) memory than a single large machine (PC clustervs. HP Superdome)



Introduction


Large applications/grand challenges

I Improvement of technological design (airplanes, cars, ships)I Improving security (car crash simulations)I Advanced methodologies (molding, burning)I Understanding nature (big bang, astrophysics, biology)I Design/interpretation of experiments (physics, biology)I More efficient production and higher quality (tissue cutting)I Forecasts (weather, markets, climate, earthquakes)I Comparisons (Database search, web, genomics, proteomics)I Recognition, detection (Image processing)I Data mining (business, statistics, customer relationship

management)I Intelligence (surveillance, code cracking)I Virtual experiments (biology, nuclear weapon tests)



Introduction

top500

TOP500 list of Nov 2010 (excerpt)

Site Computer, Year Cores Rmax Rpeak Power1 RIKEN K computer, ’11 705024 10510 11280 126602 NSC Tianjin Tianhe-1A, ’10 186368 2566 4701 40403 Oak Ridge NL Cray XT5-HE, ’09 224162 1759 2331 69504 NSC Shenzhen Dawning, ’10 120640 1271 2984 25805 Tokyo Tech HP/Nvidia, ’10 73278 1192 2287 13996 LANL/SNL Cray XE6, ’11 142272 1110 1366 39807 NASA/Ames SGI Altix, ’11 111104 1088 1315 41028 LBNL Cray XE6, ’10 153408 1054 1289 29109 CEA FR Bull, 10 138368 1050 1254 4590

10 Los Alamos NL IBM Roadrunner, ’09 122400 1042 1376 234534 CSCS Cray XE6 ’11 47872 297 402 —

477 ETH Zurich Sun Blade X6440 ’09 6464 52 66 152

Data from http://top500.org/list/2011/11/100.[Rmax and Rpeak are in TFlop/s; power data are in KW]


http://top500.org/list/2011/11/100


Introduction

top500

TOP500 performance development on Linpack benchmark since 1993.

Figure from top500.orgParallel Numerical Computing. Lecture 1, Feb 24, 2012 11/52


Introduction

top500

Processor count (figure from top500.org)



Overview on multicore processors

Development of microprocessors

Moore’s law

Moore’s law states that

The number of transistors of a processor chip doublesevery 18–24 month’s.

Note: this is an observation made 1965 (not a law) that still holds.

I A typical processor chip of 2005 consists of 200-400 miotransistors.

I Intel Core 2 Duo proc has 291 mio transistors

I IBM Cell proc has 250 mio transistors.





Moore’s law (cont’d)

I In past years: increase of the number of transistors wascomplemented by an increase of the processor frequency

I The combination lead to an annual increase of the processorperformance of

I 55% for integer operationsI 75% for floating point operations

I By consequence: If your problem is not too big you may wantto wait until there is a machine that is sufficiently fast to dothe job.

I Or (maybe more realistic): However bad you write your code,the increased computer performance will hide it.

I (Un)fortunately this time has gone: The increase of the clockrate leads to an increase of power consumption. Most of thepower is transformed in heat. This is called the power wall.





Problems

I Increase of performance based essentially on

1. increase of clock rate, i.e., decrease of cycle time2. introduction of parallelism in processor through multiple

functional units (ILP)

I Further multiplication of functional units is possible BUT

1. speed of memory access did not increase according to thespeed of the processors (see plot on next slide)

Intel i486 (1990): 6–8 cycles to access data from memory

Intel Pentium (2006): > 220 cycles to access data frommemory

2. increase of numbers of transistors on same area increasespackaging and power consumption.

Sophisticated cooling necessary.





Problems (cont.)

Biggest problem of high perfor-mance computing (and of parallelcomputing in particular):memory access!In contrast to CPU performance,memory performance doubles in 6years only ![Hennessy & Patterson, ComputerArchitecture, Morgan Kaufmann,2006]





Data access is everything in determining performance!

I To alleviate this problem, memory hierarchies with varyingaccess times have been introduced (several levels of caches).But the further away data are from the processor, the longerthey take to get and store.

I Computer memory is complicated - the further away data are,the longer they take to get, store, and move. Some estimates:

* Cache to register on same CPU: ∼ 1 nsec* Memory to register on same CPU, (on-board): ∼ 1µsec* Memory to register on a different node: ∼ 100µsec

I On modern machinery, operation counts don’t matter verymuch except as a determinant of data access. Operationcounts are thus only an indirect measure of efficiency.





How are transistors used?

I Increased word length (64bit)

I Use of pipelining to optimize consecutive machine instructions.

I Multiple functional units to enable the parallel execution ofindependent machine instructions.

I Increase of cache sizes to surmount the memory wall, i.e. theslow memory access.





Example: AMD Opteron processor

Slide courtesy of Bruce Hendricksen (Sandia NL)





Example: AMD Opteron processor (cont.)




From J. Dongarra’s PPAM’09 talk





Last 3 pages from J. Dongarra’s PPAM’09 talk.




Moore’s law

The challenges of parallel computing

Idea is simple: Connect a sufficient amount of hardware and youcan solve arbitrarily large problems.(Interconnection network for processors and memory?)

BUT, there are a few problems here...

Let’s look at the processors. According to Moore’s law (which isnot a law!) the number of transistors per square inch doubles every18 – 24 months, cf. figure on next slide.

Remark: If your problem is not too big you may want to wait untilthere is a machine that is sufficiently fast to do the job.

Remark: The previous remark is outdated!




Moore’s law

Discussion of Moore’s law

I Moore’s law was based onI Increased clock rates (+30% / year)I Increased numbers of transistors (+60-80% / year)

The latter has been invested, e.g., in multiple (parallel)functional units, see next slide.Much of the additional real estate goes into memory relatedfunctionality.

I The power consumption does prohibit a further (massive)increase of the clock rate.

I Moore’s law can only continue to hold if the number ofprocessors is increased.

I Multi-core processors (Typically dual/quad-core)I A core is a full-fledged processor that shares the main memory

with other cores.

Note (core, processor) ←→ (processor, node).I To save energy, clock rates may decrease.Parallel Numerical Computing. Lecture 1, Feb 24, 2012 30/52



Architecture of multicore processors


I Multicore processors can be realized in various ways.I The implementations differ in

I Number of coresI Size and arrangement of cachesI The way cores access cashes

A core is the simplest processing unit present on a die. It containsall the necessary logic to perform memory accesses through caches,computations both on integer and floating point values andbranches (conditional and subroutine). Usually, in a multicoreprocessor, the cores share one or several cache levels.





Architecture of multicore processors (cont.)

Hierarchical design

Multiple cores share multiplecaches, that are arrange in atree-like structure.

3-levels example:

I L1-cache in-core,

I 2 cores share L2-cache,

I all cores have access to all ofthe L3 cache and memory.





Architecture of multicore processors (cont.)I A crucial issue on multicore chips: data to be processed by

the cores has to be transfered to the cores so fast that theydo not have to wait for data.

I Powerful memory and I/O-system required.

I Today’s memory systems use private L1-caches, andL2-caches that are shared among few cores.

I For dozens of cores, another level of cache may be necessary.

I The I/O system must be able to transfer hundreds of GB/sbetween memory and processor.





Overview on various multicore processors

Intel IBM AMD SunProcessor Core 2 Duo Power 5 Opteron T1

Processor cores 2 2 2 8Instructions 4 4 3 1per CycleSMT no yes no yesL1-Cache I/D 32/32 64/32 64/64 16/8in KB per coreL2-Cache 4 MB 1.9 MB 1 MB 3 MB

shared shared per core sharedClock rate (GHz) 2.66 1.9 2.4 1.2Transistors 291 Mio 276 Mio 233 Mio 300 MioPower consumption 65 W 125 W 110 W 79 W

Ref.: Hennessy&PattersenParallel Numerical Computing. Lecture 1, Feb 24, 2012 34/52




An example: the Intel Core 2 processor

Intel Core Duo processor floor plan

See Intel Technology Journal, Volume 10, Issue 2, 2006.Parallel Numerical Computing. Lecture 1, Feb 24, 2012 35/52


Caches

Caches

Generic processor withcache



Caches

Caches on an the Opteron dual-core

I L1-Cache: 64 + 64 KB (Data + Instructions)size of cache line: 256Bthree cycle data load latencyTwo accesses per cycle, read or write8-way bank interleaved, two-way set associative

I L2-Cache: 2×1024 KB

I Important: The main memory is accessed in chunks of cachelines. There is never a single number copied from memory.

Principle of Data Locality: The safest assumption about thenext data to be used is that they are the same or nearby thelast used.

I Problem with multi-cores: cache-coherence.

I Remark: Main memory reads from disk in chunks called pages.



Caches

Cache designs

I Direct mapped means adata block can go onlyone place in the cache.

I Set associative means ablock can go anywhere ina set. If there are m sets,the number of blocks ina set is

n = (cache size in blocks)/m,

and the cache is calledan n−way set associativecache.



Flynn’s Taxonomy of Parallel Systems


In the taxonomy of Flynn parallel systems are classified accordingto the number of instruction streams and data streams.

M. Flynn: Proc. IEEE 54 (1966), pp. 1901–1909.Parallel Numerical Computing. Lecture 1, Feb 24, 2012 39/52



SISD: Single Instruction stream - Single Data stream

The classical von Neumann machine.I processor: ALU, registers

I memory holds data and program

I bus (a collection of wires) = von Neumann bottleneck

Today’s PCs or workstations are no longer true von Neumannmachines:

I superscalar processors, parallel functional units

I pipelining

I Memory interleaving (memory banks)




SIMD: Single Instruction stream - Multiple Data stream

SIMD: Single Instruction stream - Multiple Data stream

During each instruction cycle the central control unit broadcastsan instruction to the subordinate processors and each of themeither executes the instruction or is idle.At any given time a processor is “active” executing exactly thesame instruction as all other processors in a completelysynchronous way, or it is idle.




Example SIMD machines

Example SIMD machines

I Vector computers like the Cray-1, -2, X-MP, Y-MP,. . .These machines pipelined operations with vectors of 64floating point numbers.

I Streaming SIMD extensions (SSE, SSE2, SSE3, SSE4)Intel Pentium III and Pentium 4 have vector registers, alsocalled multimedia or MMX registers that support operationswith short vectors.

I Graphics processing units (GPUs).

More on SIMD processors: next week.




MIMD: Multiple Instruction stream - Multiple Data stream

MIMD: Multiple Instruction stream - Multiple Data stream

Each processor can execute its own instruction stream on its owndata independently from the other processors. Each processor is afull-fledged CPU with both control unit and ALU. MIMD systemsare asynchronous.



Memory organization

Memory organization

Most parallel machines are MIMD machines.MIMD machines are classified by their memory organization

I shared memory machines (multiprocessors)I parallel processes, threads, OpenMPI communication by means of shared variablesI data dependencies possible, race conditionI multi-core processors

I distributed memory machines (multicomputers)I all data are local to some processorI programmer responsible for data placementI communication by message passing, MPII easy / cheap to build −→ Beowulf clusters (a cluster that is

built of commodity hardware)



Interconnection network


Multiprocessors usually have a dynamic network, e.g, a crossbarswitch to connect precessing elements and memory banks.

Crossbar switch with nprocessors and m memorymodules. On the right thepossible switch states.

Uniform access, scalable, very many wires → very expensive, usedfor limited number of processors only.




Interconnection network (cont.)Multicomputers usually have static networks, like meshes, tori,hypercubes, or trees.

Processing elements areusually connected to thenetwork through routers.Routers can pipelinemessages.



Examples of MIMD machines

ETH Brutus cluster

ETH HPC cluster Brutus

I Heterogeneous system with a total of 18’400 processor coresin ∼ 1000 compute nodes (March 2012).

I Jointly owned by ∼ 50 profs (shareholders) in 12 depts and ITServices.

I Nodes connected by Gigabit Ethernet backbone andhigh-speed InfiniBand QDR network.

I Each node has multiple multicore processors.Regular nodes: 1.5-2 GB RAM/node. Fat nodes ≥ 8 GBRAM/node.

I Shared memory programming model on nodes

I Distributed memory programming model across nodes

I ccNUMA: Cache-coherent, Non-Uniform Memory Access

I C/C++, Fortran, OpenMP, OpenMPI




ETH Brutus cluster

Schematic view of Brutus

From http://www.clusterwiki.ethz.ch/brutus/Brutus_clusterParallel Numerical Computing. Lecture 1, Feb 24, 2012 48/52

http://www.clusterwiki.ethz.ch/brutus/Brutus_cluster



Supercomputers at CSCS


I CRAY XE6 – Monte Rosa

I 1496 compute nodes, each with 2 16-core AMD Opteron 6272@ 2.1 GHz Interlagos processors.

I 47’872 compute cores.I 46 TB DDR3 RAMI 290 TB DiskI Gemini 3D torus interconnect.





Supercomputers at CSCS (cont.)I CRAY XK6 – Todi

I GPU/CPU hybrid supercomputing systemI 176 nodes, each equipped with

I 16-core AMD Opteron CPUsI 32 GB DDR3 memory,I one NVIDIA Tesla X2090 GPU with 6 GB memory.

I Rank 330 on Top500 list (61 TFlop/s)





Cray Seastar2 torus network

The interconnectrouter in the CraySeaStar2+ chipprovides six high-speednetwork links whichconnect to sixneighbors in the 3Dtorus. The peakbidirectional bandwidthof each link is 9.6GB/s with sustainedbandwidth in excess of6 GB/s.





Cray Gemini 3D torus network

I Link bandwidths of 4.7 to9.4 GB/s per direction(MPI sees only 2.9 to 5.8GB/s).

I Each Gemini NIC has 10network connections, twoeach in +X, -X, +Z, -Z,and one +Y and one -Y

I Internode latency is about1.5µs on a quiet network.Latency between coresconnected to the sameGemini chip is < 1µs.


NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/1.pdf · 2012-02-23 · NUMERICAL...

Documents

Transcript of NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/1.pdf · 2012-02-23 · NUMERICAL...