A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications

A Component Infrastructure for Performance and Power

Modeling of Parallel Scientific Applications

Boyana NorrisArgonne National Laboratory

Van Bui, Lois Curfman McInnes, Li LiArgonne National Laboratory

Oscar Hernandez, Barbara ChapmanUniversity of Houston

Kevin HuckUniversity of Oregon

Outline

Motivation

Performance/Power Models

Component Infrastructure

Experiments

Conclusions and Future Work

Acknowledgements

2CBHPC, Karlsruhe, Germany, October 17, 2008

Component-Based Software Engineering

Functional unit with well-defined interfaces and dependenciesComponents interact through portsBenefits: software reuse, complex software management, code generation,

available “services”Drawback: more restrictive software engineering, need for runtime framework

CBHPC, Karlsruhe, Germany, October 17, 2008

3

Motivation

CBSE increasing in HPCPower increasing in importanceA need for simpler processes for

performance/power measurement and analysis― Performance tools can be applied at the

component abstraction layer― Opportunities for automation


4

Power vs. Energy

Rate a system performs work

Power = Work / ▲Time

Total work over a period of time

Energy = Power * ▲ Time


5

Power Trends


6

Cameron, K. W., Ge, R., and Feng, X. 2005. High-Performance, Power-Aware Distributed Computing for Scientific Applications. Computer 38, 11 (Nov. 2005), 40-47.

Power Reduction Techniques

Circuit and logic levelLow power interconnectLow power memories and memory hierarchyLow power processor architecture

adaptationsDynamic voltage scalingResource hibernationCompiler level power managementApplication level power managementCBHPC, Karlsruhe, Germany, October 17, 2008

7

Goals and Approach

Provide a component based system― Facilitates performance/power measurement and

analysis ― Computes high level performance metrics― Integrates existing tools into a uniform interface ― End Goal: static and dynamic optimizations based

on offline/online analyses


System Diagram

9

Interactive Analysis and Model Building

SubstitutionAssertionDatabase

Instrumented Component Application

Runs

Instrumented Component Application

Runs

Control System(parameter changes andcomponent substitution)

Control System(parameter changes andcomponent substitution)

CQoS-Enabled Component Application

CQoS-Enabled Component Application

Component AComponent A

Component BComponent B

Component CComponent C

Substitution Set

Machine Learning

Performance/PowerDatabases

(persistent & runtime)

Analysis Infrastructure Control Infrastructure


Performance Model I

FLP Inefficiency – PD: Problem size dependent variant FLP Inefficiency – PI: Problem size independent variant


10

Metric

Global Stalls Stall_cycles/total_cycles

% FLP Stalls FLP_stalls/stall_cycles

FLP Inefficiency – PD FLP_OPS * stalls/cycles

FLP Inefficiency – PI (FLP_OPS/retired_inst) * stall/cycle

Performance Model II

Core logic Stalls = L1D_register_stalls + branch_misprediction + instruction_miss + stack_engine_stalls + floating_point_stalls + pipeline_inter_register_dependency + processor_frontend_flush

Memory Stalls = L1_hits * L1_latency + L2_hits * L2_latency + L3_hits * L3_latency + local_mem_access * local_mem_latency + remote_mem_access * remote_mem_latency + TLB_miss * TLB_miss_penalty


11

Power Model


12

Based on on-die componentsLeverages performance hardware counters

Performance Measurement and Analysis System

Components― TAU: Performance measurement

http://www.cs.uoregon.edu/research/tau/home.php

― Performance Database Component(s)― PerfExplorer: Performance and power analysis

http://www.cs.uoregon.edu/research/tau/docs/perfexplorer/


14

PerfExplorer Component


TAU Component

TAU Component

Component App

Component App

Database ComponentsDatabase

Components

Runtime Optimization

Runtime Optimization

Compiler feedbackCompiler feedback

User/tool analysisUser/tool analysis


Loads a python analysis scriptPerformance and power analysisData mining, inference rules, comparing

different experimental runs


15

Study I: Performance-Power Trade-offs


16

Experiment – Effect of compiler optimization levels on performance and power

Experimental Details― Machine: SGI Altix 300― MPI Processes: 16― Compiler: OpenUH― Code: GenIDLEST― Optimization levels: -O0, -O1, -O2, -O3― Performance tools: TAU, PerfExplorer, and PAPI

Linux/ccNUMA


17

Results


18

Aggressive optimizations Higher power IPC ~ Power dissipation

Aggressive optimizations Lower energy Operation count ~ energy consumption

Performance/Power Study With PETSc Codes

PETSc: Portable Extensible Toolkit for Scientific Computation ― http://www.mcs.anl.gov/petsc/

Experimental Details― Machine: SGI Altix 3600― Compiler: GCC― MPI Processes: 32― Application: 2-D simulation of cavity flow

Krylov subspace linear solvers: FGMRES, GMRES, BiCGS Preconditioner: Block Jacobi Problem Size: 16x16 each processor (weak scaling)

― Performance tools: TAU, PerfExplorer, PAPI


19

Inefficiency


20

― Bottlenecks in methods used in solution of linear system

― Bottleneck also in preconditioner

Results

FGMRES has good performance initially― Not very power efficient

BCGS is optimal for performance and power efficiency


21

Conclusions

Little or no hardware and software support for detailed power measurement and analysis on modern systems

Need for more integrated toolsets supporting both performance and power measurements, analysis, and optimizations

Combining tools with component based software engineering can benefit efficiency and effectiveness of tuning process


22

Future Directions

Integration of components into a framework Dynamic selection of algorithms and

parameters based on offline/online analyses Compiler based performance power cost

modeling Continue performance and power analysis of

PETSc based codes Extension of performance and power model for

more modern architectures


23

References

Jarp, S. A methodology for using the itanium-2 performance counters for bottleneck analysis. Tech.rep., HP Labs, August 2002.

Bircher, W.L.; John, L.K. Complete System Power Estimation: A Trickle-Down Approach Based on Performance Events. International Symposium on Performance Analysis of Systems & Software, Page(s):158 - 168, 2007.

Isci, C. and Martonosi, M. 2003. Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data. In Proceedings of the 36th Annual IEEE/ACM international Symposium on Microarchitecture (December 03 - 05, 2003).

K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. D. Malony, L.C. McInnes, and B. Norris. Capturing Performance Knowledge for Automated Analysis, Supercomputing, 2008 . http://www2.cs.uh.edu/~vtbui/sc.pdf


http://www2.cs.uh.edu/~vtbui/sc.pdf

Acknowledgments

Professors/Advisors: Boyana Norris, Lois Curfman McInnes, Barbara Chapman, Allen Maloney, Danesh Tafti

Students: Oscar Hernandez, Kevin Huck, Sunita Chandrasekaran, Li Li

SiCortex: Lawrence Stuart and Dan Jackson MCS Division, Argonne National LaboratoryNSF, DOE, NCSA, NASA


25

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications

Documents

Transcript of A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications