High Performance Computing - University of Iceland · 8/23/2017 · Part Two – Introduction to...
Transcript of High Performance Computing - University of Iceland · 8/23/2017 · Part Two – Introduction to...
ADVANCED SCIENTIFIC COMPUTING
Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany
Introduction to High Performance ComputingAugust 23, 2017Room TG-227
High Performance Computing
Part Two
Outline
Part Two – Introduction to High Performance Computing 2 / 70
Outline
High Performance Computing (HPC) Basics Four basic building blocks of HPC TOP500 and Performance Benchmarks Shared Memory and Distributed Memory Architectures Hybrid and Emerging Architectures
HPC Ecosystem Technologies Software Environments & Scheduling System Architectures & Network Topologies Data Access & Large-scale Infrastructures
Parallel Programming Basics Message Passing Interface (MPI) OpenMP GPGPUs Selected Programming Challenges
Part Two – Introduction to High Performance Computing 3 / 70
High Performance Computing (HPC) Basics
Part Two – Introduction to High Performance Computing 4 / 70
What is High Performance Computing?
Wikipedia: ‘redirects from HPC to Supercomputer’ Interesting – gives us already a hint what it is generally about:
HPC includes work on ‘four basic building blocks’ in this course: Theory (numerical laws, physical models, speed-up performance, etc.) Technology (multi-core, supercomputers, networks, storages, etc.) Architecture (shared-memory, distributed-memory, interconnects, etc.) Software (libraries, schedulers, monitoring, applications, etc.)
Part Two – Introduction to High Performance Computing
A supercomputer is a computer at the frontline of contemporaryprocessing capacity – particularly speed of calculation
[1] Wikipedia ‘Supercomputer’ Online
[2] Introduction to High Performance Computing for Scientists and Engineers
5 / 70
HTC
networkinterconnectionless important!
Part Two – Introduction to High Performance Computing
HPC vs. High Throughput Computing (HTC) Systems
High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections. These are compute-oriented systems.
High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores. These are data-oriented systems
HPC
networkinterconnectionimportant
6 / 70
Parallel Computing
All modern supercomputers depend heavily on parallelism
Often known as ‘parallel processing’ of some problem space Tackle problems in parallel to enable the ‘best performance’ possible
‘The measure of speed’ in High Performance Computing matters Common measure for parallel computers established by TOP500 list Based on benchmark for ranking the best 500 computers worldwide
Part Two – Introduction to High Performance Computing
We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way
[2] Introduction to High Performance Computing for Scientists and Engineers
[3] TOP 500 supercomputing sites
7 / 70
TOP 500 List (June 2017)
Part Two – Introduction to High Performance Computing
powerchallenge
EU #1
[3] TOP 500 supercomputing sites 8 / 70
LINPACK benchmarks and Alternatives
TOP500 ranking is based on the LINPACK benchmark
LINPACK covers only a single architectural aspect (‘critics exist’) Measures ‘peak performance’: All involved ‘supercomputer elements’
operate on maximum performance Available through a wide variety of ‘open source implementations’ Success via ‘simplicity & ease of use’ thus used for over two decades
Realistic applications benchmark suites might be alternatives HPC Challenge benchmarks (includes 7 tests) JUBE benchmark suite (based on real applications)
Part Two – Introduction to High Performance Computing
LINPACK solves a dense system of linear equations of unspecified size.
[4] LINPACK Benchmark implementation
[5] HPC Challenge Benchmark Suite
[6] JUBE Benchmark Suite
The top 10 systems in the TOP500 list are dominated by companies, e.g. IBM, CRAY, Fujitsu, etc.
9 / 70
Dominant Architectures of HPC Systems
Traditionally two dominant types of architectures Shared-Memory Computers Distributed Memory Computers
Often hierarchical (hybrid) systems of both in practice Dominance in the last couple of years in the community on X86-based
commodity clusters running the Linux OS on Intel/AMD processors
More recently both above considered as ‘programming models’
Part Two – Introduction to High Performance Computing
Shared-memory parallelization with OpenMP Distributed-memory parallel programming with MPI
10 / 70
Shared-Memory Computers
Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)
The Problem of ‘Cache Coherence’ (in UMA/ccNUMA) Different CPUs use Cache to ‘modify same cache values’ Consistency between cached data & data in memory must be guaranteed ‘Cache coherence protocols’ ensure a consistent view of memory
Part Two – Introduction to High Performance Computing
A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space
[2] Introduction to High Performance Computing for Scientists and Engineers
11 / 70
Shared-Memory with UMA
Socket is a physical package (with multiple cores), typically a replacable component
Two dual core chips (2 core/socket) P = Processor core L1D = Level 1 Cache – Data (fastest) L2 = Level 2 Cache (fast) Memory = main memory (slow) Chipset = enforces cache coherence and
mediates connections to memory
Part Two – Introduction to High Performance Computing
UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations.
Also called Symmetric Multiprocessing (SMP)
[2] Introduction to High Performance Computing for Scientists and Engineers
12 / 70
Shared-Memory with ccNUMA
Eight cores (4 cores/socket); L3 = Level 3 Cache Memory interface = establishes a coherent link to enable one
‘logical’ single address space of ‘physically distributed memory’
Part Two – Introduction to High Performance Computing
ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems)
Network logic makes the aggregated memory appear as one single address space
[2] Introduction to High Performance Computing for Scientists and Engineers
13 / 70
Programming with Shared Memory using OpenMP
OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data
Part Two – Introduction to High Performance Computing
Shar
ed M
emor
yT1 T2 T3 T4 T5
Shared-memory programming enables immediate access to all data from allprocessors without explicit communication
OpenMP is dominant shared-memory programming standard today (v3)
[7] OpenMP API Specification
14 / 70
Distributed-Memory Computers
Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today
Part Two – Introduction to High Performance Computing
A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly
[2] Introduction to High Performance Computing for Scientists and Engineers
15 / 70
Programming with Distributed Memory using MPI
No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX
Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method
Part Two – Introduction to High Performance Computing
P1 P2 P3 P4 P5
Distributed-memory programming enables explicit message passing as communication between processors
MPI is dominant distributed-memory programming standard today (v2.2)
[8] MPI Standard
16 / 70
Hierarchical Hybrid Computers
Shared-memory nodes (here ccNUMA) with local NIs NI mediates connections to other remote ‘SMP nodes’
Part Two – Introduction to High Performance Computing
A hierarchical hybrid parallel computer is neither a purely shared-memory nor a purely distributed-memory type system but a mixture of both
Large-scale ‘hybrid’ parallel computers have shared-memory building blocks interconnected with a fast network today
[2] Introduction to High Performance Computing for Scientists and Engineers
17 / 70
Programming Hybrid Systems
Experience from HPC Practice Most parallel applications still take no notice of the hardware structure Use of pure MPI for parallelization remains the dominant programming
(historical reason: old supercomputers all distributed-memory type)
Challenges with the ‘mapping problem’ The performance of hybrid (as well as pure MPI codes) depends crucially
on factors not directly connected to the programming model It largely depends on the association of threads and processes to cores
Emerging ‘hybrid programming models‘ using GPGPUs and CPUs
Part Two – Introduction to High Performance Computing
Hybrid systems programming uses MPI as explicit internode communication and OpenMP for parallelization within the node
18 / 70
Emerging HPC System Architecture Developments
Increasing number of other ‘new’ emerging system architectures
General Purpose Computation on Graphics Processing Unit (GPGPUs) Use of GPUs instead for computer graphics for computing Programming models are OpenCL and Nvidia CUDA Getting more and more adopted in many application fields
Field Programmable Gate Array (FPGAs) Integrated circuit designed to be configured by a user after shipping Enables updates of functionality and reconfigurable ‘wired’
interconnects Cell processors Enables combination of general-purpose cores with co-processing
elements that accelerate dedicated forms of computationsPart Two – Introduction to High Performance Computing
Often in state of flux/vendor-specific More details are quickly outdated Not tackled in depth in this course
19 / 70
HPC Ecosystem Technologies
Part Two – Introduction to High Performance Computing 20 / 70
HPC Software Environment
Operating System Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’
Scheduling Systems Manage concurrent access of users on Supercomputers Different scheduling algorithms can be used with different ‘batch queues’ Example: SLURM @ JOTUNN Cluster, LoadLeveler @ JUQUEEN, etc.
Monitoring Systems Monitor and test status of the system (‘system health checks/heartbeat’) Enables view of usage of system per node/rack (‘system load’) Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc.
Performance Analysis Systems Measure performance of an application
and recommend improvements Examples: SCALASCA, VAMPIR, etc.
Part Two – Introduction to High Performance Computing
HPC systems typically provide a software environment thatsupport the processing ofparallel applications
21 / 70
Example: Ganglia @ Jötunn Cluster
Part Two – Introduction to High Performance Computing
[9] Jötunn Cluster Ganglia Monitoring Online
22 / 70
Scheduling Principles
HPC Systems are typically not used in an interactive fashion Program application starts ‘processes‘ on processors (‘do a job for a user‘) Users of HPC systems send ‘job scripts‘ to schedulers to start programs Scheduling enables the sharing of the HPC system with other users Closely related to Operating Systems with a wide varity of algorithms
E.g. First Come First Serve (FCFS) Queues processes in the order that they arrive in the ready queue.
E.g. Backfilling Enables to maximize cluster utilization and throughput Scheduler searches to find jobs that can fill gaps in the schedule Smaller jobs farther back in the queue run ahead of a job waiting at the
front of the queue (but this job should not be delayed by backfilling!)
Part Two – Introduction to High Performance Computing
Scheduling is the method by which user processes are given access to processor time (shared)
23 / 70
Example: Concurrent Usage of a Supercomputer
Part Two – Introduction to High Performance Computing [10] LLView Tool 24 / 70
System Architectures
HPC systems are very complex ‘machines‘ with many elements CPUs & multi-cores ‘multi-threading‘ capabilities Data access levels Different levels of Caches Network topologies Various interconnects (Example: IBM Bluegene/Q)
Part Two – Introduction to High Performance Computing
HPC faced a significant change in practice with respect to performance increase after years Getting more speed for free by waiting for new CPU generations does not work any more Multicore processors emerge that require to use those multiple resources efficiently in parallel
25 / 70
Example: Supercomputer BlueGene/Q
Part Two – Introduction to High Performance Computing [10] LLView Tool 26 / 70
Significant advances in CPU (or microprocessor chips) Multi-core architecture with dual,
quad, six, or n processing cores Processing cores are all on one chip
Multi-core CPU chip architecture Hierarchy of caches (on/off chip) L1 cache is private to each core; on-chip L2 cache is shared; on-chip L3 cache or Dynamic random access memory (DRAM); off-chip
Multi-core CPU Processors
Part Two – Introduction to High Performance Computing
one chip
Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies
[11] Distributed & Cloud Computing Book
27 / 70
Example: BlueGene Architecture Evolution
Part Two – Introduction to High Performance Computing
BlueGene/P
BlueGene/Q
28 / 70
Network Topologies
Large-scale HPC Systems have special network setups Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB) Different network topologies, e.g. tree, 5D Torus network, mesh, etc.
(raise challenges in task mappings and communication patterns)
Part Two – Introduction to High Performance Computing
[2] Introduction to High Performance Computing for Scientists and Engineers
Source: IBM
29 / 70
Data Access
P = Processor core elements Compute: floating points or integers Arithmetic units (compute operations) Registers (feed those units with operands)
‘Data access‘ for application/levels Registers: ‘accessed w/o any delay‘ L1D = Level 1 Cache – Data (fastest, normal) L2 = Level 2 Cache (fast, often) L3 = Level 3 Cache (still fast, less often) Main memory (slow, but larger in size) Storage media like harddisk, tapes, etc.
(too slow to be used in direct computing)
Part Two – Introduction to High Performance Computing
The DRAM gap is the large discrepancy between main memory and cache bandwidths
fast
er
chea
per
[2] Introduction to High Performance Computing for Scientists and Engineers
too slow
30 / 70
HPC Relationship to ‘Big Data‘
Part Two – Introduction to High Performance Computing
[12] F. Berman: ‘Maximising the Potential of Research Data’31 / 70
Large-scale Computing Infrastructures
Large computing systems are often embedded in infrastructures Grid computing for distributed data storage and processing via middleware The success of Grid computing was renowned when being mentioned by
Prof. Rolf-Dieter Heuer, CERN Director General, in the context of the Higgs Boson Discovery:
Other large-scale distributed infrastructures exist Partnership for Advanced Computing in Europe (PRACE) EU HPC Extreme Engineering and Discovery Environment (XSEDE) US HPC
Part Two – Introduction to High Performance Computing
‘Results today only possible due to extraordinary performance of Accelerators – Experiments – Grid computing’.
[13] Grid Computing YouTube Video
32 / 70
Towards new HPC Architectures – DEEP-EST EU Project
Part Two – Introduction to High Performance Computing
BN
Many CoreCPU
CN
General Purpose
CPU
MEM
GCE
FPGA MEM
NAM
FPGA
General Purpose
CPU
NVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAMNVRAM
DN
MEM
FPGA
General Purpose
CPUMEM
NVRAMNVRAM
Possible Application Workload 33 / 70
[14] DEEP-EST EU Project
Parallel Programming Basics
Part Two – Introduction to High Performance Computing 34 / 70
Part Two – Introduction to High Performance Computing
Distributed-Memory Computers Reviewed
Processors communicate via Network Interfaces (NI) NI mediates the connection to a Communication network This setup is rarely used a programming model view today
A distributed-memory parallel computer establishes a ‘system view’ where no process can access another process’ memory directly
Modified from [2] Introduction to High Performance Computing for Scientists and Engineers
Programming Model:
Message Passing
35 / 70
Part Two – Introduction to High Performance Computing
Programming with Distributed Memory using MPI
No remote memory access on distributed-memory systems Require to ‘send messages’ back and forth between processes PX
Many free Message Passing Interface (MPI) libraries available Programming is tedious & complicated, but most flexible method
P1 P2 P3 P4 P5
Distributed-memory programming enables explicit message passing as communication between processors
MPI is dominant distributed-memory programming standard today (v3.1)
[8] MPI Standard
36 / 70
Part Two – Introduction to High Performance Computing
What is MPI?
‘Communication library’ abstracting from low-level network view Offers 500+ available functions to communicate between computing nodes Practice reveals: parallel applications often require just ~12 (!) functions Includes routines for efficient ‘parallel I/O’ (using underlying hardware)
Supports ‘different ways of communication’ ‘Point-to-point communication’ between two computing nodes (P P) Collective functions involve ‘N computing nodes in useful communiction’
Deployment on Supercomputers Installed on (almost) all parallel computers Different languages: C, Fortran, Python, R, etc. Careful: Different versions might be installed
Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer
37 / 70
Part Two – Introduction to High Performance Computing
HPC Machine
Message Passing: Exchanging Data with Send/Receive
P1 P2 P3 P4 P5 P6
P
M
P
M
P
M
P
M
Each processor has its own data in its memory that can not be seen/accessed by other processors
DATA: 17
DATA: 06 DATA: 19
DATA: 80
Point-To-PointCommunications
NEW: 17
NEW: 06
Compute Node
38 / 70
Part Two – Introduction to High Performance Computing
Collective Functions : Broadcast (one-to-many)
P
M
P
M
P
M
P
M
DATA: 17
DATA: 06 DATA: 19
DATA: 80
NEW: 17
NEW: 17NEW: 17
Broadcast distributes the same data to many or even all other processors
39 / 70
Part Two – Introduction to High Performance Computing
Collective Functions: Scatter (one-to-many) Scatter distributes
different data to many or even all other processors P
M
P
M
P
M
P
M
DATA: 30DATA: 20DATA: 10
NEW: 10
DATA: 80
DATA: 19DATA: 06
NEW: 20NEW: 30
40 / 70
Part Two – Introduction to High Performance Computing
Collective Functions: Gather (many-to-one) Gather collects data
from many or even all other processorsto one specific processor P
M
P
M
P
M
P
M
DATA: 17 DATA: 80
DATA: 19DATA: 06
NEW: 80NEW: 19NEW: 06
41 / 70
Part Two – Introduction to High Performance Computing
Collective Functions: Reduce (many-to-one) Reduce combines
collection with computation based on data from many or even all other processors
P
M
P
M
P
M
P
M
DATA: 17 DATA: 80
DATA: 19DATA: 06
NEW: 122
Usage of reduce includes finding a global minimum or maximum, sum, or product of the different data locatedat different processors
++
+
+
global sumas example+
42 / 70
Part Two – Introduction to High Performance Computing
Is MPI yet another network library?
TCP/IP and socket programming libraries are plentiful available Do we need a dedicated communication & network protocols library? Goal: simplify programming in parallel programming, focus on applications
Selected reasons Designed for performance within large parallel computers (e.g. no security) Supports various interconnects between ‘computing nodes’ (hardware) Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’
MPI is not designed to handle any communication in computer networks and is thus very special Not good for clients that constantly establishing/closing connections again and again (e.g. would
have very slow performance in MPI) Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond
firewalls, no message encryption directly available, etc.)
43 / 70
Part Two – Introduction to High Performance Computing
(MPI) Basic Building Blocks: A main() function
The main() function is automatically started when launching a C program
Normally the ‘return code’ denotes whether the program exitwas ok (0) or problematic (-1)
Practice view: use of resiliency is not part of MPI (e.g. automatic restart and error handlings), therefore this is rarely used in practice
‘standard C programming…’
44 / 70
Part Two – Introduction to High Performance Computing
(MPI) Basic Building Blocks: Variables & Output
Libraries can be used by including C header files, here library forscreen outputs for example
Two integer variables that are later useful for working withspecific data obtained fromMPI library
Output with printf using stdio library: ‘Hello World’ and which process is printing out of the summary of all n processes
‘standard C programming…’
45 / 70
Part Two – Introduction to High Performance Computing
MPI Basic Building Blocks: Header & Init/Finalize
‘standard C programming including MPI library use…’ Libraries can be used by including C header files, here library forMPI included
The MPI_INIT() function initializes the MPI environment and can take inputs via themain() functionarguments
MPI_Finalize() shuts down the MPI environment(after this statement no parallel execution of the code can take place)
46 / 70
Part Two – Introduction to High Performance Computing
MPI Basic Building Blocks: Rank & Size Variables
‘standard C programming including MPI library use…’
MPI_COMM_WORLD communicator constantdenotes the ‘region of communication’, here all processes
The MPI_Comm_size()function determines the overall number of n processes in the parallel program: stores it in variable size
The MPI_Comm_rank()function determines the unique identifier for each processor:stores it in variablerank with valures (0 … n-1)
47 / 70
Part Two – Introduction to High Performance Computing
Compiling & Executing an MPI program
Compilers and linkers need various information where include files and libraries can be found E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’ Compiling is different for each programming language
Executing the MPI program on 4 processors Normally batch system allocations
(cf. SLURM on JÖTUNN cluster) Manual start-up example:
Output of the program Order of outputs
can vary because I/Oscreen ‘serial resource’
P
M
P
M
P
M
P
M
hello hello
hello hello
$> mpirun –np 4 ./hello
create 4 processes that produceoutput in parallel
48 / 70
Part Two – Introduction to High Performance Computing
Practice: Our 4 CPU Program alongside many other Programs
[10] LLView Tool
Maybe our program!
49 / 70
Part Two – Introduction to High Performance Computing
MPI Communicators
Each MPI activity specifies thecontext in which a corresponding function is performed MPI_COMM_WORLD
(region/context of all processes) Create (sub-)groups of the
processes / virtual groups of processes
Peform communicationsonly within these sub-groupseasily with well-defined processes
Using communicators wisely in collective functionscan reduce the number of affected processors
[15] LLNL MPI Tutorial
50 / 70
Part Two – Introduction to High Performance Computing
Shared-Memory Computers: Reviewed
Two varieties of shared-memory systems:1. Unified Memory Access (UMA)2. Cache-coherent Nonuniform Memory Access (ccNUMA)
A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space
[2] Introduction to High Performance Computing for Scientists and Engineers
Programming model: work on shared address space (‘local access to memory’)
51 / 70
Part Two – Introduction to High Performance Computing
Programming with Shared Memory using OpenMP
OpenMP is a set of compiler directives to ‘mark parallel regions’ Bindings are defined for C, C++, and Fortran languages Threads TX are ‘lightweight processes’ that mutually access data (Shared-Memory concept itself is very old, like POSIX Threads)
Shar
ed M
emor
yT1 T2 T3 T4 T5
Shared-memory programming enables immediate access to all data from allprocessors without explicit communication
OpenMP is dominant shared-memory programming standard today (v3)
[7] OpenMP API Specification
52 / 70
Part Two – Introduction to High Performance Computing
What is OpenMP?
OpenMP is a library for specifying ‘parallel regions in serial code’ Defined by major computer hardware/software vendors portability! Enable scalability with parallelization constructs w/o fixed thread numbers Offers a suitable data environment for easier parallel processing of data Uses specific environment variables for clever decoupling of code/problem Included in standard C compiler distributions (e.g. gcc)
Threads are the central entity in OpenMP Threads enable ‘work-sharing’ and
share address space (where data resides) Threads can be synchronized if needed Lightweight process that share
common address space with other threads Initiating (aka ‘spawning’) n threads is less
costly then n processes (e.g. variable space)
Recall ‘computing nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer
Threads are lightweight processes that work with data in memory
53 / 70
Part Two – Introduction to High Performance Computing
Parallel and Serial Regions
modified from [2] Introduction to High Performance Computing for Scientists and Engineers
fork() initiated by master thread (exists always) creates team of threads
Team of threads concurrently work on shared-memory data actively in parallel regions
join() initiates the ‘shutdown’ of the parallel region and terminates team of threads
Team of threads maybe also put to sleep until next parallel region begins
Number of threads can be different in each parallel region
OpenMPprogram
54 / 70
Part Two – Introduction to High Performance Computing
Number of Threads & Scalability
The real number of threads normally not known at compile time (There are methods for doing it in the program do not use them!) Number is set in scripts and/or environment variable before executing
Parallel programming is done without knowing number of threads
OpenMP programs should be always written in a way that itdoes not assume a specific number of threads Scalable program
int main()
{
#pragma omp parallel
printf(„Hello World“);
}
int main()
{
#pragma omp parallel
printf(„Hello World“);
}
./helloworld.exe
Hello World
Hello World
Hello World
Hello World
./helloworld.exe
Hello World
Hello World
Hello World
Hello World
compile & execute
mT T0
T1
T2
T3
masterthreadbecomes T0
55 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Library & Sentinel
‘standard fortran programming…’
‘standard C/C++ programming…’
The OpenMP library contains OpenMP API definitions
The Sentinel is a special string that starts an OpenMP compiler directive
Practice view: programming OpenMP in C/C++ and fortran is slightly different,but providing the same basic concepts (e.g. no end of parallel region in C/C++, local variables, etc.)
56 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Unique Thread IDs
‘standard fortran programming…’
‘standard C/C++ programming…’
do_work_package() routine code is now executed in parallel by each thread
BUT also sub-routines of that routine are now executed in parallel
omp_get_thread_num() function provides unique Thread ID (0…n-1)
omp_get_num_threads() function obtains number of active threads in the current parallel region
57 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Private Variables (Fortran)
PRIVATE defines local variables for each thread
Each thread works independently and thus needs space to ‘store’ local results
Practice view: the real parallelization idea is here in the loop: the simple sum of two arrays For each value of i we can compute and store array values independently from each other
Same code executed n times with n threads, BUT tid is unique and thus different for each thread
58 / 70
Part Two – Introduction to High Performance Computing
Traditional HelloWorld Example (C/C++)#include <omp.h>
#include <stdio.h>
int main(argc,argv)
int argc; char *argv[]; {
int nthreads, tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf(„Hello World from thread = %d\n“, tid);
if (tid == 0){
nthreads = omp_get_num_threads();
printf(„Number of threads in parallel region = %d\n“, nthreads);
}
}
}
Simple Parallel Program Only the master (tid=0) provides
output of how many threads are existing in the parallel region
Shared variable nthreads Local variable tid
59 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Loops (do, for in C/C++) FIRSTPRIVATE()
copies initial value of shared variable to local variable(simple init here otherwise problems in loop)
DO loop (in front of usual do) distributes (automatically) loop iterations among threads as specifically supported ‘work-sharing’ construct
Smart programming support by OpenMP: Loops are very often part of scientific applications! Less burden for programmer: no manual definition of local variables (e.g. i automatically localized)
local sum exists, but where is the global sum?
60 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Critical Regions Local sum exists in
each of the different threads
We have n times the local variable value for sum now
Race Condition in shared-memory: shared variable pi will be set concurrently by the different threads
Value of pi depends on the exact order the threads access pi and assign wrong values
Critical regions define a region within a parallel region where at most one thread at a time executes code (e.g. sum of new pi based on pi)
61 / 70
Part Two – Introduction to High Performance Computing
OpenMP Basic Building Blocks: Reduction Several operations are
common in scientific applications
+, *, -, &, |, ^, &&, ||, max, min
REDUCTION() with operator + on variable s enables here …
Starting with a local copy of s for each thread
During progress of parallel region each local copy of s will be accumulated seperately by each thread
At the end of the parallel region automatically synchronized and accumulated with resulting master thread variable
Reduction operations are a smart alternative to manualcritical regions definitions around operations of variables
Reduction operation automatically localizes variable
62 / 70
Many-core GPUs
Use of very many simple cores High throughput computing-oriented architecture Use massive parallelism by executing a lot of concurrent threads slowly Handle an ever increasing amount of multiple instruction threads CPUs instead typically execute a single long thread as fast as possible
Many-core GPUs are used in large clusters and within massively parallel supercomputers today Named General-Purpose
Computing on GPUs (GPGPU)
Part Two – Introduction to High Performance Computing
Graphics Processing Unit (GPU) is great for data parallelism and task parallelism Compared to multi-core CPUs, GPUs consist of a many-core architecture with
hundreds to even thousands of very simple cores executing threads rather slowly
[11] Distributed & Cloud Computing Book
63 / 70
GPU Acceleration
GPU accelerator architecture example (e.g. NVIDEA card) GPUs can have 128 cores on one single GPU chip Each core can work with eight threads of instructions GPU is able to concurrently execute 128 * 8 = 1024 threads Interaction and thus major (bandwidth)
bottleneck between CPU and GPU is via memory interactions
E.g. applications that use matrix –vector multiplication
Part Two – Introduction to High Performance Computing
CPU acceleration means that GPUs accelerate computing due to a massive parallelism with thousands of threads compared to only a few threads used by conventional CPUs
GPUs are designed to compute large numbers of floating point operations in parallel
[11] Distributed & Cloud Computing Book
64 / 70
NVIDEA Fermi GPU Example
Part Two – Introduction to High Performance Computing
[11] Distributed & Cloud Computing Book
65 / 70
Challenges: Domain Decomposition & Load Imbalance
Part Two – Introduction to High Performance Computing 66 / 70
[16] Map Analysis - Understanding Spatial Patterns and Relationships, Book
Modified from [2] Introduction to High Performance Computing for Scientists and Engineers
unusedresources
Load imbalance hampers performance, because some resources are underutilized
boundary halo
Challenges: Ghost/Halo Regions & Stencil Methods
Part Two – Introduction to High Performance Computing 67 / 70
[2] Introduction to High Performance Computing for Scientists and Engineers3 * 16 = 48 4 * 8 = 32
Stencil-based iterative methods update array elements according to a fixed pattern called ‘stencil‘
The key of stencil methods is its regular structure mostly implemented using arrays in codes
Lecture Bibliography
Part Two – Introduction to High Performance Computing 68 / 70
Lecture Bibliography
[1] Wikipedia ‘Supercomputer’, Online: http://en.wikipedia.org/wiki/Supercomputer [2] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein,
Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X
[3] TOP500 Supercomputing Sites, Online: http://www.top500.org/ [4] LINPACK Benchmark, Online: http://www.netlib.org/benchmark/hpl/ [5] HPC Challenge Benchmark Suite, Online: http://icl.cs.utk.edu/hpcc/ [6] JUBE Benchmark Suite, Online:
http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html [7] The OpenMP API specification for parallel programming, Online:
http://openmp.org/wp/openmp-specifications/ [8] The MPI Standard, Online: http://www.mpi-forum.org/docs/ [9] Jötunn HPC Cluster, Online: http://ihpc.is/jotunn/ [10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html [11] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book,
Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049 [12] Fran Berman, ‘Maximising the Potential of Research Data’ [13] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online:
http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw [14] DEEP-EST EU Project, Online: http://www.deep-projects.eu/ [15] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/ [16] Map Analysis, Understanding Spatial Patterns and Relationships, Joseph K. Berry, Online:
http://www.innovativegis.com/basis/Books/MapAnalysis/Default.htm
Part Two – Introduction to High Performance Computing 69 / 70
Part Two – Introduction to High Performance Computing 70 / 70