Extreme Scale Heterogeneous Computing: Needs for … · 2013-09-03 · Drivers of Modern HPC...
Transcript of Extreme Scale Heterogeneous Computing: Needs for … · 2013-09-03 · Drivers of Modern HPC...
Extreme Scale Heterogeneous Computing: Needs for Improvements?
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Lightning Round Panel at XSCALE ‘13
by
High-End Computing (HEC): PetaFlop to ExaFlop
XSCALE13-Panel 2
Expected to have an ExaFlop system in 2019 -2022!
100 PFlops in
2015
1 EFlops in 2018
Exascale System Targets
Systems 2010 2018-2022 Difference Today & 2018-2022
System peak 2 PFlop/s 1 EFlop/s O(1,000)
Power 6 MW ~20 MW (goal)
System memory 0.3 PB 32 – 64 PB O(100)
Node performance 125 GF 1.2 or 15 TF O(10) – O(100)
Node memory BW 25 GB/s 2 – 4 TB/s O(100)
Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)
Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)
O(100)
System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)
Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]
O(10,000)
Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)
O(10) – O(100)
IO Rates 0.2 TB 60 TB/s O(100)
MTTI Days O(1 day) -O(10)
Courtesy: DOE Exascale Study and Prof. Jack Dongarra
3 XSCALE13-Panel
• Maximizing concurrency
• Maximum overlap between computation, communication
and I/O
• Minimizing data movement
• Minimizing energy requirement
• Achieving fault tolerance
Broad Goals of Exascale Systems
4 XSCALE13-Panel
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• Commodity interconnect like InfiniBand and Proprietary interconnects from
IBM and Cray •
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
5
Multi-core Processors
XSCALE13-Panel
6
System Efficiency (Rmax/Rpeak) in the Top500 List (June 2013)
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500
Effic
ienc
y (%
)
Top 500 Systems IB-No Accelerator IB-Accelerator (GPU) IB-Accelerator (MIC) GiGE
10GiGE IBM-BG Cray Other
Efficiency will become more
critical as we go to Exascale
XSCALE13-Panel
Typical Hardware/Software Stack
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 10/40GigE,
Aries, BlueGene)
Multi/Many-core Architectures
XSCALE13-Panel
Point-to-point Communication (two-sided & one-sided)
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Library or Runtime for Programming Models
7
Accelerators (NVIDIA and MIC)
Middleware
• Accelerators/Coprocessors are connected over PCI-E
• Disjoint distributed memory between host processes and
accelerators/coprocessors
– Data transfer is prone to PCI-E bottlenecks
– Different communication paths with varying costs
• CCL-Proxy (for MIC) and GPUDirect-RDMA (for GPGPUs) are trying to alleviate the
communication bottleneck
– However, a big difference in communication cost between
• Intra-node
• Host- MIC or Host-GPU
• MIC-MIC or GPU-GPU (over network)
• Heterogeneous computing power between
– Host core and MIC core
– Host core and GPU core
Common Issues on Current Generation Multi-Petaflop Architectures/Systems
8 XSCALE13-Panel
Interplay Across Different Layers
XSCALE13-Panel
Programming Models (?)
Applications (Algorithms)
Technologies and Integration - Multi-core
- Networking - Accelerators
Library or Runtime for
Programming Models
(Algorithms)
9
• Tight integration between CPU cores, Accelerator/Coprocessor cores and
Networking
– APUs, Knights Landing, ….
– ARM-based designs
• Hybrid programming models
– MPI and PGAS or MPI+X
– Example: MVAPICH2-X (Hybrid MPI and PGAS (UPC and OpenSHMEM))
• Design of efficient runtime for programming models
– Pt-to-pt communication
– Collective communication (non-blocking, hierarchical, topology-aware)
– Remote Memory Access (RMA)
Needs for Improvements
10 XSCALE13-Panel
• Newer algorithms at applications-level
– Exploit hybrid programming models
– Minimize data movement
– Maximize overlap of computation and communication
• Energy-aware design
– At all levels
• Fault-tolerant design
– At all levels
• Co-design across all layers
– Hardware (CPU, Accelerators, Coprocessors and Network)
– Programming models
– Runtime
– Applications
Needs for Improvements (Cont’d)
11 XSCALE13-Panel
Conclusions
XSCALE13-Panel 12
Hybrid Programming Models
Applications (Algorithms)
Technologies - Multi-core
- Networking -Accelerators
(Hybrid Systems)
Library or Runtime for
Programming Models
(Algorithms)
Multi-Petascale Exascale
New Algorithms and Programming Models Power
constraints & fault-
tolerance will lead to exploration
of co-design
strategies Tighter integration of CPU, Accelerators
and Networking
Efficient and Optimized
Runtimes