Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory...

1
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song, Hung-ching Chang, Rong Ge†, Xizhou Feng†, Dong Li, and Kirk W. Cameron [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] † Also affiliated with Marquette University. Results I: Power Profiling and Analysis Energy Analysis of the HPC Challenge Benchmarks Results II: Detailed Function-level Analysis The PowerPack 2.0 Framework Components: 1. Hardware power/energy profiling 2. Software power/energy profiling control 3. Software system power/energy control 4. Data collection/fusion/analysis 5. System under test Main features: a) Direct measurements of the power consumption of a system’s major components (i.e. CPU, Memory, and disk, etc) and /or an entire computing unit. b) Automatic logging of power profiles and synchronization to application source code. c) Scalable, fast, and accurate. HPCC Power Profile of Full Benchmark Run The power signatures of each application are unique. In the figure below, power consumption is separated by major computing components including CPU, Memory, Disk and Motherboard. These four components capture nearly all the dynamic power usage of the system. The figure above shows that parallel computation changes the locality of data accesses and impacts the major computing components’ power profiles over the execution of the benchmarks. profi le Detailed power/energy/performance profiling and analysis of various global benchmarks of HPCC including scalability tests, parallel efficiency and power-function mapping. Analysis: 1) Each test in the benchmark suite stresses processor and memory power relative to their use. For example, as Global HPL and Star DGEMM have high temporal and spatial locality, they spend little time waiting on data and stress the processor's floating point execution units intensively consuming more processor power than other tests. 2) Changes in processor and memory power profiles correlate to communication to computation ratios. Power varies for global tests such as PTRAN, HPL, and MPI_FFT because of their computation and communication phases. 3) Disk power and motherboard power are relatively stable over all tests. 4) Processors consume more power during GLOBAL and STAR tests since they use all processor cores in the computation. LOCAL tests use only one core per node and thus consume less energy. Detailed power profiles for four Global HPCC benchmarks across eight computing nodes with 32 cores. Key Findings (1) This work identifies power profiles by system component and application function level. (2) This work reveals the correlation between spatio-temporal locality and energy use for these benchmarks. (3) This work explores the relationship between scalability and energy use for high-end systems. About the HPC Challenge Benchmarks HPC Challenge (HPCC) benchmarks are specifically designed to stress aspects of application and system design ignored by NAS Benchmarks and LINPACK to aid in system procurements and evaluations. HPCC organizes the benchmarks into four categories; each category Represents a type of memory access pattern characterized by the Benchmarks’ memory access spatial and temporal locality. We use a classification scheme to separate performance phases that make up the HPCC benchmark suites as shown in the table: 1: Local (single processor) 2. Star (Embarrassingly parallel ) 3. Global (explicit parallel data communications) Spatio-temporal locality vs. Avg Power Use HPCC is designed to stress all the aspects of a high-performance system including CPU, memory, disk, and network. We characterized HPCC results based on data locality. Since lower temporal and spatial locality imply higher average memory access delay times, applications with (low, low) temporal-spatial locality use less power on average. Since higher temporal and spatial locality imply lower average memory access delay times, applications with (high, high) temporal-spatial locality use more power on average. Mixed temporal and spatial locality implies mixed results that fall between the average power ranges of (high, high) and (low, low) temporal-spatial locality codes. Hardw arepow er/energyprofiling HPC Cluster Softw arepow er/energycontrol Data Collection A snapshot of the HPCC power profile. The entire run of HPCC consists of seven micro benchmark tests in the order as follows. 1. PTRANS, 2 HPL, 3. Star DGEMM + single DGEMM, 4. Star STREAM, 5. MPI_RandomAccess, 6. Star_RandomAccess, 7. Single_RandomAccess, 8. MPI_FFT, Star_ FFT, single FFT and latency/bandwidth. feature s Portions of this work have appeared in the following publications: Shuaiwen Song, Rong Ge, Xizhou Feng, Kirk W. Cameron, “Energy Profiling and Analysis of HPC Challenge Benchmarks,” International Journal of High Performance Computing Applications , Vol. 23, No. 3, 265- 276 (2009). Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W. Cameron, "PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications," IEEE Transactions on Parallel and Distributed Systems, to appear (2009). Detailed power-function mapping of MPI_FFT in HPCC. Energy Profiling and Efficiency Under Strong Scaling and Weak Scaling of HPCC Strong Scaling Weak Scaling Conclusions: •Each application has a unique power profile characterized by power distribution among major system components. •The power profiles of the HPCC benchmark suite reveal power boundaries for real applications. •Energy efficiency is a critical issue in high performance computing that requires further study since the interactions between hardware and application affect power usage dramatically. System G and PowerPack 2.0 System G (Green) : System G provides a research platform for the development of high-performance software tools and applications with extreme efficiency at scale. SystemG Stats 325 Mac Pro Computer nodes, each with two 4-core 2.8 gigahertz (GHZ) Intel Xeon Processors. Each node has eight gigabytes (GB) random access memory (RAM). Each core has 6 MB cache. Mellanox 40Gb/s end-to-end InfiniBand adapters and switches. LINPACK result: 22.8 TFLOPS (trillion operations per sec) Over 10,000 power and thermal sensors Variable power modes: DVFS control, Fan-Speed control, Concurrency throttling, Dynamic system temperature control. Intelligent Power Distribution Unit: Dominion PX What makes System G so Green? System G provides a research platform for the development of high-performance software tools and applications. PowerPack Framework Amplified phase at function level. B enchm ark SpatialLocalityTemporalLocality M ode D escription HPL High High Global Stresses FP perf DGEMM High High Star+Local Stresses FP perf STREAM High Low Star M easures M em BW PTRANS High Low Global M easures D ata Transfer FFT Low High Global+Star+Local M easures FP + D ata transfer R andom Access Low Low Global+Star+Local R andom in m em ory updates C om m Latency/BW Low Low Global M easure latency + BW HPC C hallenge B enchm ark P erform ance C haracteristics The authors would like to thank the National Science Foundation for support of this work under grants CCF #0848670, CNS #0720750, and CNS #0709025. analyze

Transcript of Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory...

Page 1: Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Energy Profiling And Analysis Of The HPC Challenge BenchmarksScalable Performance Laboratory Department of Computer Science Virginia Tech

Shuaiwen Song, Hung-ching Chang, Rong Ge†, Xizhou Feng†, Dong Li, and Kirk W. Cameron

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

† Also affiliated with Marquette University.

Results I: Power Profiling and Analysis

Energy Analysis of the HPC Challenge Benchmarks

Results II: Detailed Function-level Analysis

The PowerPack 2.0 FrameworkComponents:1. Hardware power/energy profiling2. Software power/energy profiling control3. Software system power/energy control4. Data collection/fusion/analysis5. System under test

Main features:a) Direct measurements of the power consumption

of a system’s major components (i.e. CPU, Memory, and disk, etc) and /or an entire computing unit.

b) Automatic logging of power profiles and synchronization to application source code.

c) Scalable, fast, and accurate.

HPCC Power Profile of Full Benchmark Run

The power signatures of each application are unique. In the figure below, power consumption is separated by major computing components including CPU, Memory, Disk and Motherboard. These four components capture nearly all the dynamic power usage of the system.

The figure above shows that parallel computation changes the locality of data accesses and impacts the major computing components’ power profiles over the execution of the benchmarks.

profile

Detailed power/energy/performance profiling and analysis of various global benchmarks of HPCC including scalability tests, parallel efficiency and power-function mapping.

Analysis:1) Each test in the benchmark

suite stresses processor and memory power relative to their use. For example, as Global HPL and Star DGEMM have high temporal and spatial locality, they spend little time waiting on data and stress the processor's floating point execution units intensively consuming more processor power than other tests.

2) Changes in processor and memory power profiles correlate to communication to computation ratios. Power varies for global tests such as PTRAN, HPL, and MPI_FFT because of their computation and communication phases.

3) Disk power and motherboard power are relatively stable over all tests.

4) Processors consume more power during GLOBAL and STAR tests since they use all processor cores in the computation. LOCAL tests use only one core per node and thus consume less energy.

Detailed power profiles for four Global HPCC benchmarks across eight computing nodes with 32 cores.

Key Findings

(1) This work identifies power profiles by system component and application function level.

(2) This work reveals the correlation between spatio-temporal locality and energy use for these benchmarks.

(3) This work explores the relationship between scalability and energy use for high-end systems.

About the HPC Challenge Benchmarks

HPC Challenge (HPCC) benchmarks are specifically designed tostress aspects of application and system design ignored by NAS Benchmarks and LINPACK to aid in system procurements and evaluations. HPCC organizes the benchmarks into four categories; each categoryRepresents a type of memory access pattern characterized by the Benchmarks’ memory access spatial and temporal locality.We use a classification scheme to separate performance phases thatmake up the HPCC benchmark suites as shown in the table: 1: Local (single processor) 2. Star (Embarrassingly parallel ) 3. Global (explicit parallel data communications)

Spatio-temporal locality vs. Avg Power Use

HPCC is designed to stress all the aspects of a high-performancesystem including CPU, memory, disk, and network. We characterizedHPCC results based on data locality.

• Since lower temporal and spatial locality imply higher average memory access delay times, applications with (low, low) temporal-spatial locality use less power on average.

• Since higher temporal and spatial locality imply lower average memory access delay times, applications with (high, high) temporal-spatial locality use more power on average.

• Mixed temporal and spatial locality implies mixed results that fall between the average power ranges of (high, high) and (low, low) temporal-spatial locality codes.

Hardware power/energy profiling

HPC Cluster

Software power/energy control

Data Collection

A snapshot of the HPCC power profile. The entire run of HPCC consists of seven micro benchmark tests in the order as follows. 1. PTRANS, 2 HPL, 3. Star DGEMM + single DGEMM, 4. Star STREAM, 5. MPI_RandomAccess, 6. Star_RandomAccess, 7. Single_RandomAccess, 8. MPI_FFT, Star_ FFT, single FFT and latency/bandwidth.

features

Portions of this work have appeared in the following publications:Shuaiwen Song, Rong Ge, Xizhou Feng, Kirk W. Cameron, “Energy Profiling and Analysis of HPC Challenge Benchmarks,” International Journal of High Performance Computing Applications, Vol. 23, No. 3, 265-276 (2009).Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W. Cameron, "PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications," IEEE Transactions on Parallel and Distributed Systems, to appear (2009).

Detailed power-function mapping of MPI_FFT in HPCC.

Energy Profiling and Efficiency Under Strong Scaling and Weak Scaling of HPCC

Strong Scaling Weak Scaling

Conclusions:

•Each application has a unique power profile characterized by power distribution among major system components.

•The power profiles of the HPCC benchmark suite reveal power boundaries for real applications.

•Energy efficiency is a critical issue in high performance computing that requires further study since the interactions between hardware and application affect power usage dramatically.

System G and PowerPack 2.0

System G (Green) : System G provides a research platform for the development of high-performance software tools and applications with extreme efficiency at scale.

SystemG Stats• 325 Mac Pro Computer nodes, each with two 4-core 2.8 gigahertz (GHZ) Intel Xeon Processors.

•Each node has eight gigabytes (GB) random access memory (RAM). Each core has 6 MB cache.

•Mellanox 40Gb/s end-to-end InfiniBand adapters and switches.

•LINPACK result: 22.8 TFLOPS (trillion operations per sec)

•Over 10,000 power and thermal sensors

•Variable power modes: DVFS control, Fan-Speed control, Concurrency throttling, Dynamic system temperature control.

•Intelligent Power Distribution Unit: Dominion PX

What makes System G so Green?

System G provides a research platform for the development of high-performance software tools and applications.

PowerPack Framework

Amplified phase at function

level.

Benchmark Spatial Locality Temporal Locality Mode DescriptionHPL High High Global Stresses FP perfDGEMM High High Star+Local Stresses FP perfSTREAM High Low Star Measures Mem BWPTRANS High Low Global Measures Data TransferFFT Low High Global+Star+Local Measures FP + Data transferRandomAccess Low Low Global+Star+Local Random in memory updatesComm Latency/BW Low Low Global Measure latency + BW

HPC Challenge Benchmark Performance Characteristics

The authors would like to thank the National Science Foundation for support of this work under grants CCF #0848670, CNS #0720750, and CNS #0709025.

analyze