Lecture on Scienti c Computing - TU Berlin€¦ · I Programming in parallel: OpenMP, MPI, VL...

Lecture on

Scientific Computing

Dr. Kersten Schmidt

Lecture 18

Technische Universitat BerlinInstitut fur Mathematik

Wintersemester 2014/2015

Syllabus

I Linear Regression, Fast Fourier transformI Modelling by partial differential equations (PDEs)

I Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equationI boundary value problem, eigenvalue problemI boundary conditions (Dirichlet, Neumann, Robin)I handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML)I boundary integral equations

I Computer aided-design (CAD)

I Mesh generatorsI Space discretisation of PDEs

I Finite difference methodI Finite element methodI Discontinuous Galerkin finite element method

I SolversI Linear Solvers (direct, iterative), preconditionerI Nonlinear Solvers (Newton-Raphson iteration)I Eigenvalue Solvers

I ParallelisationI Computer hardware (SIMD, MIMD: shared/distributed memory)I Programming in parallel: OpenMP, MPI

,

VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 2

Computer hardware

Central Processing unit (CPU) – the processor – consisting ofI the arithmetic logic unit (ALU), which performs arithmetic and logic operations,

I hardware register, that supply operands to the to the ALU and store results,

I the control unit that fetches instructions from main memory,

I an hierarchy of CPU caches (level L1, L2, L3) for storing temporarily data which willneeded for the next instructions, and

I possibly an integrated graphics processor.

A processor may consists of several repetitions of the subunits, the cores, to obtainparallelisation.

,


Computer hardware

Clock signal (dt. Taktsignal)

I for synchronisation of the operations in CPUand fetching from and writing to memory

I Clock rate is number of cycles per time, e.g. 3.6 GHz for i7-4790

I Algebraic and logic operations need their number of cycles

I Receiving data from memory need several cycles (latency)

Vector processor or SIMD (single instruction multiple data) extension

I realize same operation on many similar dataat the same time, e.g. matrix operations.

I Streaming SIMD extension in modern PC CPUs(from Intel Pentium III, AMD Athlon-XP)with additional SIMD register.

I Increase of number of instructions per cycle.

,


Computer hardware

Example: use of SIMD

,


Computer hardware

A variant of SIMD: a pipeline

I Complicated operations take often more than one cyle to complete, e.g.multiplication of two integer take 4 clock ticks

Example: Element by element product ci = ai bi of two vectors (Hadamard’s product)of integers of length n take 4n clock ticks

Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Pipeline: Split operation in several stages that each take one cycle, then a piplelinecan (after a startup phase) produce a result in each clock cycle

Let the numbers ai , bi , ci be split in four fragments (bytes, little-endian)

ai = [ai,3, ai,2, ai,1, ai,0]

bi = [bi,3, bi,2, bi,1, bi,0]

ci = [ci,3, ci,2, ci,1, ci,0]

then

ci,j = ai,j bi,j + carry from ai,j−1 bi,j−1

,


Computer hardware

A variant of SIMD: a pipeline

Example: Element by element product ci = ai bi of two vectors (Hadamard’s product)of integers of length n take 4n clock ticks

Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Speed-up

S =4n

4 + n∼ n if n ≥ 4

,


Computer hardware

Computer memoryB primary (RAM, dt. Arbeitsspeicher)

I storing the program (instructions) torun and data to work on(concept by von Neumann, Princeton)

I lose data if device is powered down

B secondary

I do not lose data if device is powered down

I examples: flash memory (e.g. solid state drives, SSD), magnetic discs (hard andfloppy disk), optical disc (e.g. CD-ROM)

B cache (as part of the CPU)

I storing temporarily data which will needed for the next instructions

,


Computer hardware

Prefetch data in cache which will be need for further instructions

No prefetch: Processor stalls periodically while waiting to retrieve data from main memory

into cache or into processor’s register

Prefetching data before processor completes previous task eliminates stall times.

Stall time remain while Prefetching data for fast computations, but it is shorter.

,


Computer hardware

Prefetch data in cache which will be need for further instructions

Example: Application of a function to each component ai of a vector → biPetersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004

Simple loop

for(i = 0; i < n; ++i)

b[i] = f(a[i]);

with prefetching (hiding next load of a[i] under loop overhead)

t = a[0]; /* prefetch a[0] */

for(i = 0; i < n-1; ) {

b[i] = f(t);

t = a[++i]; /* prefetch a[i+1] */

}

b[n-1] = f(t);

,


Computer hardware

Moore’s law: observation of an exponential grows of speed of integrated circuits /processors

I number of transistors per sqare centimeter double every 18 to 24 monthI but access time to memory has not improved accordingly, memory performance

double in 6 years only (→ hierarchy of cache)

”Transistor Count and Moore’s Law - 2011“ von Wgsimon - Eigenes Werk (Wikimedia Commons)

,


Computer hardware

Overview of Parallel Computing Hardware

Parallel computing : distribute the instructions on many processors to decrease theoverall computation time

Parallel systems can be classified according to the number of instruction streams anddata streams

M. Flynn, Proc. IEEE 54 (1996), 1901–1909.

,


Computer hardware

SISD: Singe instruction stream – Single data streamB the classical von Neumann machine

SIMD: Single instruction stream – Multiple data streams

I During each instruction cycle thecentral control unit broadcasts aninstruction to the processors andeach of them either executes theinstruction or is idle.

I At any given time a processor isactive and executes exactly thesame instruction as all otherprocessors in a completelysynchronous way, or is idle.

,


Computer hardware

MIMD: multiple instruction stream – Multiple data streams

I each processor can execute its own instruction stream on its own dataindependently from other processors

I each processor is a full-fledged (dt. vollwertig) CPU with control unit and ALU

I MIMD are asynchronous, each processor can execute its own program

Generic distributed memory computer, e.g., cluster

Generic shared memory computer, e.g., compute server

,


Computer hardware

Clusters with partly distributed memory and shared memory

Clusters/Compute server at the Institut for Mathematics, TU Berlinhttp://www.math.tu-berlin.de/iuk/computeserver

I AMD Clusters, Intel Clusters, GPU Clusters, IBM-Cell processor cluster

I Batch system (dt. System der Stapelverarbeitung) to submit (parallel) jobs

,


Parallelisation

Processes

I A process is an instance of a program that is executing more or lessautonomously on a physical processor.

I A program is parallel if, at any time during its execution, it comprises more thanone process.

Shared-memory and distributed-memory programs differ in how processescommunicate with each other:

I In shared-memory programs communication is through variables that are sharedby all the involved processes.

I In distributed-memory programs processes communicate by sending and receivingmessages.

,


Parallelisation

Sequential algorithms

I to estimate the run-time of a sequential algorithm we cound the number offloating point operations (flops) – sometimes it may be better to count the number

of memory accesses, but that does not change the picture in comparison with parallel

algorithms

Tseq = (number of flops)× tflop

Example: dot product of two n-vectors

~x · ~y =

n∑i=1

xiyi ⇒ Tseq = (2n − 1)tflop

Parallel algorithm – what does execution time mean on a parallel processor?

,


Parallelisation

Parallel algorithm – what does execution time mean on a parallel processor?I on a distributed arrangement of processors there is no common or synchronized

clockI one may choose the maximal execution time of the program on the processors

involved in the computationI if we measure time in parallel we do this on one processor (P0) – the master

processorI processor P0 usually reads input data and outputs the computed resultsI for theoretical considerations we simply assume that all processors (say p) start

in the same momentI the execution time T (p) then is the period of time from this moment until when

the last of the p processors finishes its computationI T (1) is the execution time of the best sequential algorithm

Paper by David Bayley in Supercomputer in 1991, who show how to manipulate timings in

order to claim successful parallelisation of an algorithm.

Speedup – measures the gain in (wall-clock) time that is obtained by parallelexecution of a program

,


Lecture on Scienti c Computing - TU Berlin€¦ · I Programming in parallel: OpenMP, MPI, VL...

Documents

Transcript of Lecture on Scienti c Computing - TU Berlin€¦ · I Programming in parallel: OpenMP, MPI, VL...