Post on 12-Jan-2016
Parallel Computing Through MPI Technologies
Author: Nyameko Lisa
Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov
Outline – Parallel Computing through MPI Technologies
Introduction Overview of MPI General Implementation Examples Application to Physics Problems Concluding Remarks
Introduction – Need for Parallelism
More stars in the sky than there are grains of sands on all the beaches of the world
Introduction – Need for Parallelism
It requires approximately 204 billion atoms to encode the human genome sequence
Vast number of problems from a wide range of fields have significant computational requirements
Introduction – Aim of Parallelism
Attempt to divide a single problem into multiple parts
Distribute the segments of said problem amongst various processes or nodes
Provide a platform layer to manage data exchange between multiple processes that solve a common problem simultaneously
Introduction – Serial Computation
Problem divided into discrete, serial sequence of instructions
Each executed individually, on a single CPU
Introduction – Parallel Computation
Same problem distributed amongst several processes (program and allocated data)
Introduction – Implementation
Main goal is to save time and hence money– Furthermore can solve larger problems – depleted resources – Overcome intrinsic limitations of serial computation– Distributed systems provide redundancy, concurrency and
access to non-local resources, e.g. SETI, Facebook, etc 3 methodologies for implementation of parallelism
– Physical Architecture– Framework– Algorithm
In practice will almost always be combination of above Greatest hurdle is managing distribution of information
and data exchange i.e. overhead
Introduction – Top 500
Japan’s K Computer (Kei = 10 quadrillion) Currently fastest supercomputer cluster in the world 8.162 petaflops (~8 x 1015 calculations per second)
Overview – What is MPI?
Message Passing Interface One of many frameworks and technologies
for implementing parallelization Library of subroutines (FORTRAN), classes
(C/C++) and bindings for python packages that mediate communication (via messages) between single threaded processes, executing independently and in parallel
Overview – What is needed?
Common user accounts with same password Administrator / root privileges for all accounts Common directory structure and paths MPICH2 installed on all machines This is combination of MPI-1 and MPI-2
standards CH – Chameleon portability layer provides
backward compatibility to existing MPI frameworks
Overview – What is needed?
MPICC & MPIF77 – Provide options and special libraries needed to compile and link MPI programs
MPIEXEC – Initialize parallel jobs and spawn copies of the executable to all of the processes
Each process executes its own copy of code By convention choose root process (rank 0) to
serve as master process
General ImplementationHello World - C++
General ImplementationHello World - FORTRAN
General ImplementationHello World - Output
Example - Broadcast Routine
Point-to-point (send & recv) and Collective (bcast) library routines are contained in library
Source node mediates distribution of data to/from all other nodes
Example - Broadcast RoutineLinear Case
Apart from root and last nodes, each node receives from and sends to previous and next node respectively
Use point-to-point library routines to build custom collective routine
MPI_RECV(myProc - 1)
MPI_SEND(myProc + 1)
Example - Broadcast RoutineBinary Tree
Each parent node sends message to two child nodes
MPI_SEND(2 * myProc)
MPI_SEND(2 * myProc+1) IF( MOD(myProc,2) == 0 ) MPI_RECV( myProc/2 ) ELSE MPI_RECV((myProc-1)/2)
Example – Broadcast RoutineOutput
Applications to Physics Problems
Quadrature – Discretize interval [a,b] into N steps and divide amongst processes: – FOR LOOP (1+myProc to N;incr of numProcs)– E.g. with N = 10 and numProcs = 3
Process: Iteration1, Iteration2,… 0: 1,4,7,10 1: 2,5,8 2: 3,6,9
Finite Difference problems – Similarly divide mesh/grid amongst processes
Many applications, limited only by our ingenuity
Closing Remarks
In 1970’s, Intel co-founder Gordon Moore, correctly predicted that, ”number of transistors that can be inexpensively placed on an integrated circuit doubles approximately every 2 years”
10-Core Xeon E7 processor family chips are currently commercially available
MPI easy to implement and well suited to many independent operations that can be executed simultaneously
Only limitations are overhead incurred by inter-process communications, out ingenuity ands strictly sequential segments of program
Acknowledgements and Thanks
NRF and South African Department of Science and Technology
JINR, University Center Dr. Jacobs and Prof. Lekala Prof. Elena Zemlyanaya, Prof Alexandr P.
Sapozhnikov and Tatiana F. Sapozhnikov Last but not least my fellow colleagues