Lecture on Scienti c Computing - TU Berlin€¦ · I Programming in parallel: OpenMP, MPI, VL...
Transcript of Lecture on Scienti c Computing - TU Berlin€¦ · I Programming in parallel: OpenMP, MPI, VL...
Lecture on
Scientific Computing
Dr. Kersten Schmidt
Lecture 18
Technische Universitat BerlinInstitut fur Mathematik
Wintersemester 2014/2015
Syllabus
I Linear Regression, Fast Fourier transformI Modelling by partial differential equations (PDEs)
I Maxwell, Helmholtz, Poisson, Linear elasticity, Navier-Stokes equationI boundary value problem, eigenvalue problemI boundary conditions (Dirichlet, Neumann, Robin)I handling of infinite domains (wave-guide, homogeneous exterior: DtN, PML)I boundary integral equations
I Computer aided-design (CAD)
I Mesh generatorsI Space discretisation of PDEs
I Finite difference methodI Finite element methodI Discontinuous Galerkin finite element method
I SolversI Linear Solvers (direct, iterative), preconditionerI Nonlinear Solvers (Newton-Raphson iteration)I Eigenvalue Solvers
I ParallelisationI Computer hardware (SIMD, MIMD: shared/distributed memory)I Programming in parallel: OpenMP, MPI
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 2
Computer hardware
Central Processing unit (CPU) – the processor – consisting ofI the arithmetic logic unit (ALU), which performs arithmetic and logic operations,
I hardware register, that supply operands to the to the ALU and store results,
I the control unit that fetches instructions from main memory,
I an hierarchy of CPU caches (level L1, L2, L3) for storing temporarily data which willneeded for the next instructions, and
I possibly an integrated graphics processor.
A processor may consists of several repetitions of the subunits, the cores, to obtainparallelisation.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 3
Computer hardware
Clock signal (dt. Taktsignal)
I for synchronisation of the operations in CPUand fetching from and writing to memory
I Clock rate is number of cycles per time, e.g. 3.6 GHz for i7-4790
I Algebraic and logic operations need their number of cycles
I Receiving data from memory need several cycles (latency)
Vector processor or SIMD (single instruction multiple data) extension
I realize same operation on many similar dataat the same time, e.g. matrix operations.
I Streaming SIMD extension in modern PC CPUs(from Intel Pentium III, AMD Athlon-XP)with additional SIMD register.
I Increase of number of instructions per cycle.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 4
Computer hardware
Example: use of SIMD
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 5
Computer hardware
A variant of SIMD: a pipeline
I Complicated operations take often more than one cyle to complete, e.g.multiplication of two integer take 4 clock ticks
Example: Element by element product ci = ai bi of two vectors (Hadamard’s product)of integers of length n take 4n clock ticks
Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Pipeline: Split operation in several stages that each take one cycle, then a piplelinecan (after a startup phase) produce a result in each clock cycle
Let the numbers ai , bi , ci be split in four fragments (bytes, little-endian)
ai = [ai,3, ai,2, ai,1, ai,0]
bi = [bi,3, bi,2, bi,1, bi,0]
ci = [ci,3, ci,2, ci,1, ci,0]
then
ci,j = ai,j bi,j + carry from ai,j−1 bi,j−1
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 6
Computer hardware
A variant of SIMD: a pipeline
Example: Element by element product ci = ai bi of two vectors (Hadamard’s product)of integers of length n take 4n clock ticks
Petersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Speed-up
S =4n
4 + n∼ n if n ≥ 4
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 7
Computer hardware
Computer memoryB primary (RAM, dt. Arbeitsspeicher)
I storing the program (instructions) torun and data to work on(concept by von Neumann, Princeton)
I lose data if device is powered down
B secondary
I do not lose data if device is powered down
I examples: flash memory (e.g. solid state drives, SSD), magnetic discs (hard andfloppy disk), optical disc (e.g. CD-ROM)
B cache (as part of the CPU)
I storing temporarily data which will needed for the next instructions
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 8
Computer hardware
Prefetch data in cache which will be need for further instructions
No prefetch: Processor stalls periodically while waiting to retrieve data from main memory
into cache or into processor’s register
Prefetching data before processor completes previous task eliminates stall times.
Stall time remain while Prefetching data for fast computations, but it is shorter.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 9
Computer hardware
Prefetch data in cache which will be need for further instructions
Example: Application of a function to each component ai of a vector → biPetersen and Arbenz, “Introduction to Parallel Computing”, Oxford University Press, 2004
Simple loop
for(i = 0; i < n; ++i)
b[i] = f(a[i]);
with prefetching (hiding next load of a[i] under loop overhead)
t = a[0]; /* prefetch a[0] */
for(i = 0; i < n-1; ) {
b[i] = f(t);
t = a[++i]; /* prefetch a[i+1] */
}
b[n-1] = f(t);
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 10
Computer hardware
Moore’s law: observation of an exponential grows of speed of integrated circuits /processors
I number of transistors per sqare centimeter double every 18 to 24 monthI but access time to memory has not improved accordingly, memory performance
double in 6 years only (→ hierarchy of cache)
”Transistor Count and Moore’s Law - 2011“ von Wgsimon - Eigenes Werk (Wikimedia Commons)
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 11
Computer hardware
Overview of Parallel Computing Hardware
Parallel computing : distribute the instructions on many processors to decrease theoverall computation time
Parallel systems can be classified according to the number of instruction streams anddata streams
M. Flynn, Proc. IEEE 54 (1996), 1901–1909.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 12
Computer hardware
SISD: Singe instruction stream – Single data streamB the classical von Neumann machine
SIMD: Single instruction stream – Multiple data streams
I During each instruction cycle thecentral control unit broadcasts aninstruction to the processors andeach of them either executes theinstruction or is idle.
I At any given time a processor isactive and executes exactly thesame instruction as all otherprocessors in a completelysynchronous way, or is idle.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 13
Computer hardware
MIMD: multiple instruction stream – Multiple data streams
I each processor can execute its own instruction stream on its own dataindependently from other processors
I each processor is a full-fledged (dt. vollwertig) CPU with control unit and ALU
I MIMD are asynchronous, each processor can execute its own program
Generic distributed memory computer, e.g., cluster
Generic shared memory computer, e.g., compute server
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 14
Computer hardware
Clusters with partly distributed memory and shared memory
Clusters/Compute server at the Institut for Mathematics, TU Berlinhttp://www.math.tu-berlin.de/iuk/computeserver
I AMD Clusters, Intel Clusters, GPU Clusters, IBM-Cell processor cluster
I Batch system (dt. System der Stapelverarbeitung) to submit (parallel) jobs
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 15
Parallelisation
Processes
I A process is an instance of a program that is executing more or lessautonomously on a physical processor.
I A program is parallel if, at any time during its execution, it comprises more thanone process.
Shared-memory and distributed-memory programs differ in how processescommunicate with each other:
I In shared-memory programs communication is through variables that are sharedby all the involved processes.
I In distributed-memory programs processes communicate by sending and receivingmessages.
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 16
Parallelisation
Sequential algorithms
I to estimate the run-time of a sequential algorithm we cound the number offloating point operations (flops) – sometimes it may be better to count the number
of memory accesses, but that does not change the picture in comparison with parallel
algorithms
Tseq = (number of flops)× tflop
Example: dot product of two n-vectors
~x · ~y =
n∑i=1
xiyi ⇒ Tseq = (2n − 1)tflop
Parallel algorithm – what does execution time mean on a parallel processor?
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 17
Parallelisation
Parallel algorithm – what does execution time mean on a parallel processor?I on a distributed arrangement of processors there is no common or synchronized
clockI one may choose the maximal execution time of the program on the processors
involved in the computationI if we measure time in parallel we do this on one processor (P0) – the master
processorI processor P0 usually reads input data and outputs the computed resultsI for theoretical considerations we simply assume that all processors (say p) start
in the same momentI the execution time T (p) then is the period of time from this moment until when
the last of the p processors finishes its computationI T (1) is the execution time of the best sequential algorithm
Paper by David Bayley in Supercomputer in 1991, who show how to manipulate timings in
order to claim successful parallelisation of an algorithm.
Speedup – measures the gain in (wall-clock) time that is obtained by parallelexecution of a program
,
VL Scientific Computing WS 2014/2015, Dr. K. Schmidt, 01/20/2015 18