Extreme Scale Heterogeneous Computing: Needs for … · 2013-09-03 · Drivers of Modern HPC...

Extreme Scale Heterogeneous Computing: Needs for Improvements?

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Lightning Round Panel at XSCALE ‘13

by




High-End Computing (HEC): PetaFlop to ExaFlop

XSCALE13-Panel 2

Expected to have an ExaFlop system in 2019 -2022!

100 PFlops in

2015

1 EFlops in 2018

Exascale System Targets

Systems 2010 2018-2022 Difference Today & 2018-2022

System peak 2 PFlop/s 1 EFlop/s O(1,000)

Power 6 MW ~20 MW (goal)

System memory 0.3 PB 32 – 64 PB O(100)

Node performance 125 GF 1.2 or 15 TF O(10) – O(100)

Node memory BW 25 GB/s 2 – 4 TB/s O(100)

Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)

Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)

O(100)

System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]

O(10,000)

Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)

O(10) – O(100)

IO Rates 0.2 TB 60 TB/s O(100)

MTTI Days O(1 day) -O(10)

Courtesy: DOE Exascale Study and Prof. Jack Dongarra

3 XSCALE13-Panel

• Maximizing concurrency

• Maximum overlap between computation, communication

and I/O

• Minimizing data movement

• Minimizing energy requirement

• Achieving fault tolerance

Broad Goals of Exascale Systems

4 XSCALE13-Panel

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• Commodity interconnect like InfiniBand and Proprietary interconnects from

IBM and Cray •

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

5

Multi-core Processors

XSCALE13-Panel

6

System Efficiency (Rmax/Rpeak) in the Top500 List (June 2013)

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500

Effic

ienc

y (%

)

Top 500 Systems IB-No Accelerator IB-Accelerator (GPU) IB-Accelerator (MIC) GiGE

10GiGE IBM-BG Cray Other

Efficiency will become more

critical as we go to Exascale

XSCALE13-Panel

Typical Hardware/Software Stack

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, etc.

Application Kernels/Applications

Networking Technologies (InfiniBand, 10/40GigE,

Aries, BlueGene)

Multi/Many-core Architectures

XSCALE13-Panel

Point-to-point Communication (two-sided & one-sided)

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

7

Accelerators (NVIDIA and MIC)

Middleware

• Accelerators/Coprocessors are connected over PCI-E

• Disjoint distributed memory between host processes and

accelerators/coprocessors

– Data transfer is prone to PCI-E bottlenecks

– Different communication paths with varying costs

• CCL-Proxy (for MIC) and GPUDirect-RDMA (for GPGPUs) are trying to alleviate the

communication bottleneck

– However, a big difference in communication cost between

• Intra-node

• Host- MIC or Host-GPU

• MIC-MIC or GPU-GPU (over network)

• Heterogeneous computing power between

– Host core and MIC core

– Host core and GPU core

Common Issues on Current Generation Multi-Petaflop Architectures/Systems

8 XSCALE13-Panel

Interplay Across Different Layers

XSCALE13-Panel

Programming Models (?)

Applications (Algorithms)

Technologies and Integration - Multi-core

- Networking - Accelerators

Library or Runtime for

Programming Models

(Algorithms)

9

• Tight integration between CPU cores, Accelerator/Coprocessor cores and

Networking

– APUs, Knights Landing, ….

– ARM-based designs

• Hybrid programming models

– MPI and PGAS or MPI+X

– Example: MVAPICH2-X (Hybrid MPI and PGAS (UPC and OpenSHMEM))

• Design of efficient runtime for programming models

– Pt-to-pt communication

– Collective communication (non-blocking, hierarchical, topology-aware)

– Remote Memory Access (RMA)

Needs for Improvements

10 XSCALE13-Panel

• Newer algorithms at applications-level

– Exploit hybrid programming models

– Minimize data movement

– Maximize overlap of computation and communication

• Energy-aware design

– At all levels

• Fault-tolerant design

– At all levels

• Co-design across all layers

– Hardware (CPU, Accelerators, Coprocessors and Network)

– Programming models

– Runtime

– Applications

Needs for Improvements (Cont’d)

11 XSCALE13-Panel

Conclusions

XSCALE13-Panel 12

Hybrid Programming Models

Applications (Algorithms)

Technologies - Multi-core

- Networking -Accelerators

(Hybrid Systems)

Library or Runtime for

Programming Models

(Algorithms)

Multi-Petascale Exascale

New Algorithms and Programming Models Power

constraints & fault-

tolerance will lead to exploration

of co-design

strategies Tighter integration of CPU, Accelerators

and Networking

Efficient and Optimized

Runtimes

Extreme Scale Heterogeneous Computing: Needs for … · 2013-09-03 · Drivers of Modern HPC...

Documents

Transcript of Extreme Scale Heterogeneous Computing: Needs for … · 2013-09-03 · Drivers of Modern HPC...