A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications

24
A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois Curfman McInnes, Li Li Argonne National Laboratory Oscar Hernandez, Barbara Chapman University of Houston Kevin Huck University of Oregon

description

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications. Boyana Norris Argonne National Laboratory Van Bui, Lois Curfman McInnes, Li Li Argonne National Laboratory Oscar Hernandez, Barbara Chapman University of Houston Kevin Huck - PowerPoint PPT Presentation

Transcript of A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications

A Component Infrastructure for Performance and Power

Modeling of Parallel Scientific Applications

Boyana NorrisArgonne National Laboratory

Van Bui, Lois Curfman McInnes, Li LiArgonne National Laboratory

Oscar Hernandez, Barbara ChapmanUniversity of Houston

Kevin HuckUniversity of Oregon

Outline

Motivation

Performance/Power Models

Component Infrastructure

Experiments

Conclusions and Future Work

Acknowledgements

2CBHPC, Karlsruhe, Germany, October 17, 2008

Component-Based Software Engineering

Functional unit with well-defined interfaces and dependenciesComponents interact through portsBenefits: software reuse, complex software management, code generation,

available “services”Drawback: more restrictive software engineering, need for runtime framework

CBHPC, Karlsruhe, Germany, October 17, 2008

3

Motivation

CBSE increasing in HPCPower increasing in importanceA need for simpler processes for

performance/power measurement and analysis― Performance tools can be applied at the

component abstraction layer― Opportunities for automation

CBHPC, Karlsruhe, Germany, October 17, 2008

4

Power vs. Energy

Rate a system performs work

Power = Work / ▲Time

Total work over a period of time

Energy = Power * ▲ Time

CBHPC, Karlsruhe, Germany, October 17, 2008

5

Power Trends

CBHPC, Karlsruhe, Germany, October 17, 2008

6

Cameron, K. W., Ge, R., and Feng, X. 2005. High-Performance, Power-Aware Distributed Computing for Scientific Applications. Computer 38, 11 (Nov. 2005), 40-47.

Power Reduction Techniques

Circuit and logic levelLow power interconnectLow power memories and memory hierarchyLow power processor architecture

adaptationsDynamic voltage scalingResource hibernationCompiler level power managementApplication level power managementCBHPC, Karlsruhe, Germany, October 17, 2008

7

Goals and Approach

Provide a component based system― Facilitates performance/power measurement and

analysis ― Computes high level performance metrics― Integrates existing tools into a uniform interface ― End Goal: static and dynamic optimizations based

on offline/online analyses

8CBHPC, Karlsruhe, Germany, October 17, 2008

System Diagram

9

Interactive Analysis and Model Building

SubstitutionAssertionDatabase

Instrumented Component Application

Runs

Instrumented Component Application

Runs

Control System(parameter changes andcomponent substitution)

Control System(parameter changes andcomponent substitution)

CQoS-Enabled Component Application

CQoS-Enabled Component Application

Component AComponent A

Component BComponent B

Component CComponent C

Substitution Set

Machine Learning

Performance/PowerDatabases

(persistent & runtime)

Analysis Infrastructure Control Infrastructure

CBHPC, Karlsruhe, Germany, October 17, 2008

Performance Model I

FLP Inefficiency – PD: Problem size dependent variant FLP Inefficiency – PI: Problem size independent variant

CBHPC, Karlsruhe, Germany, October 17, 2008

10

Metric

Global Stalls Stall_cycles/total_cycles

% FLP Stalls FLP_stalls/stall_cycles

FLP Inefficiency – PD FLP_OPS * stalls/cycles

FLP Inefficiency – PI (FLP_OPS/retired_inst) * stall/cycle

Performance Model II

Core logic Stalls = L1D_register_stalls + branch_misprediction + instruction_miss + stack_engine_stalls + floating_point_stalls + pipeline_inter_register_dependency + processor_frontend_flush

Memory Stalls = L1_hits * L1_latency + L2_hits * L2_latency + L3_hits * L3_latency + local_mem_access * local_mem_latency + remote_mem_access * remote_mem_latency + TLB_miss * TLB_miss_penalty

CBHPC, Karlsruhe, Germany, October 17, 2008

11

Power Model

CBHPC, Karlsruhe, Germany, October 17, 2008

12

Based on on-die componentsLeverages performance hardware counters

Performance Measurement and Analysis System

Components― TAU: Performance measurement

http://www.cs.uoregon.edu/research/tau/home.php

― Performance Database Component(s)― PerfExplorer: Performance and power analysis

http://www.cs.uoregon.edu/research/tau/docs/perfexplorer/

CBHPC, Karlsruhe, Germany, October 17, 2008

14

PerfExplorer Component

PerfExplorer Component

TAU Component

TAU Component

Component App

Component App

Database ComponentsDatabase

Components

Runtime Optimization

Runtime Optimization

Compiler feedbackCompiler feedback

User/tool analysisUser/tool analysis

PerfExplorer Component

Loads a python analysis scriptPerformance and power analysisData mining, inference rules, comparing

different experimental runs

CBHPC, Karlsruhe, Germany, October 17, 2008

15

Study I: Performance-Power Trade-offs

CBHPC, Karlsruhe, Germany, October 17, 2008

16

Experiment – Effect of compiler optimization levels on performance and power

Experimental Details― Machine: SGI Altix 300― MPI Processes: 16― Compiler: OpenUH― Code: GenIDLEST― Optimization levels: -O0, -O1, -O2, -O3― Performance tools: TAU, PerfExplorer, and PAPI

Linux/ccNUMA

CBHPC, Karlsruhe, Germany, October 17, 2008

17

Results

CBHPC, Karlsruhe, Germany, October 17, 2008

18

Aggressive optimizations Higher power IPC ~ Power dissipation

Aggressive optimizations Lower energy Operation count ~ energy consumption

Performance/Power Study With PETSc Codes

PETSc: Portable Extensible Toolkit for Scientific Computation ― http://www.mcs.anl.gov/petsc/

Experimental Details― Machine: SGI Altix 3600― Compiler: GCC― MPI Processes: 32― Application: 2-D simulation of cavity flow

Krylov subspace linear solvers: FGMRES, GMRES, BiCGS Preconditioner: Block Jacobi Problem Size: 16x16 each processor (weak scaling)

― Performance tools: TAU, PerfExplorer, PAPI

CBHPC, Karlsruhe, Germany, October 17, 2008

19

Inefficiency

CBHPC, Karlsruhe, Germany, October 17, 2008

20

― Bottlenecks in methods used in solution of linear system

― Bottleneck also in preconditioner

Results

FGMRES has good performance initially― Not very power efficient

BCGS is optimal for performance and power efficiency

CBHPC, Karlsruhe, Germany, October 17, 2008

21

Conclusions

Little or no hardware and software support for detailed power measurement and analysis on modern systems

Need for more integrated toolsets supporting both performance and power measurements, analysis, and optimizations

Combining tools with component based software engineering can benefit efficiency and effectiveness of tuning process

CBHPC, Karlsruhe, Germany, October 17, 2008

22

Future Directions

Integration of components into a framework Dynamic selection of algorithms and

parameters based on offline/online analyses Compiler based performance power cost

modeling Continue performance and power analysis of

PETSc based codes Extension of performance and power model for

more modern architectures

CBHPC, Karlsruhe, Germany, October 17, 2008

23

References

Jarp, S. A methodology for using the itanium-2 performance counters for bottleneck analysis. Tech.rep., HP Labs, August 2002.

Bircher, W.L.; John, L.K. Complete System Power Estimation: A Trickle-Down Approach Based on Performance Events. International Symposium on Performance Analysis of Systems & Software, Page(s):158 - 168, 2007.

Isci, C. and Martonosi, M. 2003. Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data. In Proceedings of the 36th Annual IEEE/ACM international Symposium on Microarchitecture (December 03 - 05, 2003).

K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. D. Malony, L.C. McInnes, and B. Norris. Capturing Performance Knowledge for Automated Analysis, Supercomputing, 2008 . http://www2.cs.uh.edu/~vtbui/sc.pdf

24CBHPC, Karlsruhe, Germany, October 17, 2008

Acknowledgments

Professors/Advisors: Boyana Norris, Lois Curfman McInnes, Barbara Chapman, Allen Maloney, Danesh Tafti

Students: Oscar Hernandez, Kevin Huck, Sunita Chandrasekaran, Li Li

SiCortex: Lawrence Stuart and Dan Jackson MCS Division, Argonne National LaboratoryNSF, DOE, NCSA, NASA

CBHPC, Karlsruhe, Germany, October 17, 2008

25