poster ContaCts: ML03 DL01 [email protected] ... - NVIDIA · • Detailed evaluation •...

1
Core Streaming Multiprocessor Core NUMA Node RAM Memory Compute Cluster Compute Node I/O Disk GPU GPU Disk Non-Volatile Storage Energy source CPU CPU CPU Core Core Core L2 Cache L2 Cache L2 Cache L2 Cache L3 Cache Control Unit ALU L1 Cache ALU ALU ALU ALU GPGPU Control Unit RAM Memory Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor L2 Cache Core Control Unit Core Core Core Core Core Core Core L1 Cache Super Function Unit Super Function Unit Control Unit ALU Energy Consumption HPC architectures: their power costs almost reach the purchase price over a life time. Careful application-specific tuning can help to reduce energy consumption without sacrificing an application's performance. Interprocess Communication is significantly influenced by the amount and the speed of communication required. Reducing the communication volume and exploiting the physical network topology can lead to great performance boosts. Load Balancing Implicit/explicit process synchronization and uneven distribution of work may leave a process idle waiting for others to finish. The computing power within all parallel processes must be exploited to the fullest, otherwise program scalability may be limited. Data Locality Frequent accesses to shared or distant data creates a considerable overhead. resources and ensuring data locality can yield significant performance improvements. Memory Access Even the best arithmetically-optimized codes can stall a processor core due to latency in memory access. Careful optimization of memory access patterns can make the most of CPU Single Core Performance To achieve good overall performance each core's compute capabilities need to be optimally exploited. By providing access to the implementation details of a targetted platform, application optimizations can be specialized accordingly. The Challenge of Programming Parallel Architectures Core The shift to multi- and many-core architectures made it more complex to develop hardware-optimized applications. A number of performance analysis tools exist to support the application tuning process, but none of them provide recommendations about how to tune the code. Tuning Strategy Plugin Strategy Analysis Strategy Static Analysis and Instrumentation (source code preprocessing) Start of Analysis and Tuning via Periscope Front-End Selection of Optimization Hypothesis Selection Performance Experiment Performance Analysis Transformation and/or Parameter (Set) Selection Optional Application Restart Verification Experiment(s) Generate Tuning Report (remaining properties and tuning actions) AutoTune will develop the Periscope Tuning Framework (PTF) extending Periscope. It will follow Periscope's main principles, i.e. the use of formalized expert knowledge in form of properties and strategies, automatic execution, online search based on program phases, and distributed processing. Periscope will be extended by a number of online and semi-online tuning plugins responsible for searching for a tuned code version. PTF Tuning Control Flow Renato Miceli 1 , Gilles Civario 1 , François Bodin 2 1 Irish Centre for High-End Computing, Trinity Technology & Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland 2 CAPS Entreprise, Immeuble CAP Nord, Bât A, 4 Allée Marie Berhaut, 35000 Rennes, France [email protected], [email protected], [email protected] 1. Abstract Performance analysis and tuning is an important step in programming multicore and manycore architectures. There are several tools to help developers analyze application performance; still, no tool provides recommendations about how to tune the code. AutoTune will extend Periscope, an automatic online and distributed performance analysis tool developed by Technische Universität München, with plugins for performance and energy efficiency tuning. The resulting Periscope Tuning Framework will be able to tune serial and parallel codes with and without GPU kernels; in addition, it will return tuning recommendations that can be integrated into the production version of the code. The whole tuning process, consisting of both automatic performance analysis and automatic tuning, will be executed online, i.e. during a single run of the application. Funding by European Union FP7 Project no. 288038 Start date: October 15 th , 2011 Duration: 36 months (2011-2014) Total Cost: 3.1 million Contact: Michael Gerndt <[email protected]> 7. AutoTune Project Work Plan 2. Motivation 4. Periscope: the basis for AutoTune 5. AutoTune’s Tuning Framework 3. Project Goals The AutoTune Projectʼs goal is to close the gap in the application tuning process and simplify the development of efficient parallel programs. It focuses on automatic tuning for multicore- and manycore-based parallel systems, ranging from desktop systems with and without GPGPUs to petascale and future exascale HPC architectures. To achieve this objective, AutoTune aims at developing the Periscope Tuning Framework (PTF), the first framework to combine and automate both analysis and tuning into a single tool. AutoTuneʼs PTF will… Identify tuning alternatives based on codified expert knowledge. Evaluate the alternatives online (i.e. within the same application execution), reducing the overall search time for a tuned version. Produce a report on how to improve the code, which can be manually or automatically applied. a. Which Tuning Plugins? GPU programming with HMPP and OpenCL Single-core performance tuning MPI tuning Energy efficiency tuning b. How they will be implemented? Master Agent: responsible for implementing the overall tuning strategy Analysis Agents: may implement portions of the tuning plugins MRI Monitor: measures energy consumed and monitors the GPU infrastructure; may implement region-specific tuning actions (e.g. changing the clock frequency for a specific program region) C/C++ and Fortran instrumenter: extended for parallel pattern support and HMPP and OpenCL codes c. Using which techniques? Expert knowledge Iterative search Machine learning Model-guided empirical optimization 6. Goal Validation Achievement of the goals will be measured and evaluated at the projectʼs end: 1. Applications that can benefit from the tuning techniques will be selected and manually tuned during the course of the project. 2. At the end of the project PTF will be run over the same applications. The improvements achieved and the required effort for both the manual and the automatic tuning will be compared. It is expected that: PTF will obtain at least 50% of manual improvements, or even surpass them (>100%); PTF will require only a single or a few application runs, compared to effort timed in months for manual tuning. Optimize Test Measure Analyze AutoTune Project Specification of the tuning model Results of the manual tuning of selected applications demonstrating the potential of the tuning techniques Detailed technical specification for all work packages Extended version of Periscope's monitor for the tuning plugins PTF Demonstrator demonstrating the integration of PA and tuning plugins Month 12 Month 24 Prototype versions of the tuning plugins Single plugin tuning strategies PA strategies for HMPP/OpenCL and energy efficiency PTF Integrated Prototype demonstrating single plugin tuning for the selected applications Cycle 1 Cycle 2 Cycle 3 Cycle 4 Month 32 Month 36 Final tuning plugins and combined plugin tuning strategies PTF Release Documentation of PTF Detailed evaluation Demonstration of the promised automatic improvements for the selected applications Common performance analyzers only hint at where to tune. AutoTune’s PTF will also tune the code for you! PTF will take a program written in MPI/OpenMP with/without kernels for GPGPUs in HMPP or OpenCL and will automatically tune it with respect to performance and energy usage. PTF will generate a tuning report such that a developer can integrate the tuning recommendations for production runs. Project Consortium Associate Partner Funding DL01 POSTER CONTACTS: Renato Miceli: [email protected] Gilles Civario: [email protected]

Transcript of poster ContaCts: ML03 DL01 [email protected] ... - NVIDIA · • Detailed evaluation •...

Page 1: poster ContaCts: ML03 DL01 lisasmith@pge.com ... - NVIDIA · • Detailed evaluation • Demonstration of the promised automatic improvements for the selected applications Common

Core

StreamingMultiprocessor

Core

NUMA Node

RAMMemory

ComputeCluster

Compute Node

I/ODisk

GPU

GPU

Disk

Non-Volatile Storage

Energy source

CPU CPU

CPU

Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

L3 Cache

ControlUnit

ALU

L1 Cache

ALU ALU

ALU

ALU

GPGPU ControlUnit

RAMMemory

StreamingMultiprocessor

StreamingMultiprocessor

StreamingMultiprocessor

StreamingMultiprocessor

StreamingMultiprocessor

StreamingMultiprocessor

L2 Cache

CoreControl

Unit

Core Core Core

Core Core Core Core

L1 Cache

Super Function Unit

Super Function Unit

ControlUnit ALU

Energy Consumption

HPC architectures: their power costs almost reach the purchase price over a life time. Careful application-specific tuning can help to reduce energy consumption without sacrificing an application's performance.

Interprocess Communication

is significantly influenced by the amount and the speed of communication required. Reducing the communication volume and exploiting the physical network topology can lead to great performance boosts.

Load Balancing Implicit/explicit process synchronization and uneven distribution of work may leave a process idle waiting for others to finish. The computing power within all parallel processes must be exploited to the fullest, otherwise program scalability may be limited.

Data Locality Frequent accesses to shared or distant data creates a considerable overhead.

resources and ensuring data locality can yield significant performance improvements.

Memory Access Even the best arithmetically-optimized codes can stall a processor core due to latency in memory access. Careful optimization of memory access patterns can make the most of CPU

Single Core Performance To achieve good overall performance each core's compute capabilities need to be optimally exploited. By providing access to the implementation details of a targetted platform, application optimizations can be specialized accordingly.

The Challenge of Programming Parallel Architectures

Core

The shift to multi- and many-core architectures made it more complex to develophardware-optimized applications. A number of performance analysis tools exist to support theapplication tuning process, but none of them provide recommendations about how to tune the code.

TuningStrategy

PluginStrategy

AnalysisStrategy

Static Analysis andInstrumentation

(source code preprocessing)

Start of Analysis and Tuningvia Periscope Front-End

Selection of Optimization

Hypothesis Selection

Performance Experiment

Performance Analysis

Transformation and/orParameter (Set) Selection

Optional Application Restart

Verification Experiment(s)

Generate Tuning Report(remaining properties

and tuning actions)

AutoTune will develop the Periscope Tuning Framework (PTF)extending Periscope. It will follow Periscope's main principles,

i.e. the use of formalized expert knowledge in form of propertiesand strategies, automatic execution, online search based on

program phases, and distributed processing. Periscope will beextended by a number of online and semi-online tuning plugins

responsible for searching for a tuned code version.

PTF Tuning Control Flow

Renato Miceli1, Gilles Civario1, François Bodin2 1 Irish Centre for High-End Computing, Trinity Technology & Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland

2 CAPS Entreprise, Immeuble CAP Nord, Bât A, 4 Allée Marie Berhaut, 35000 Rennes, France [email protected], [email protected], [email protected]

1. Abstract Performance analysis and tuning is an important step in programming multicore and manycore architectures. There are several tools to help developers analyze application performance; still, no tool provides recommendations about how to tune the code.AutoTune will extend Periscope, an automatic online and distributed performance analysis tool developed by Technische Universität München, with plugins for performance and energy efficiency tuning. The resulting Periscope Tuning Framework will be able to tune serial and parallel codes with and without GPU kernels; in addition, it will return tuning recommendations that can be integrated into the production version of the code. The whole tuning process, consisting of both automatic performance analysis and automatic tuning, will be executed online, i.e. during a single run of the application."

Funding by European Union FP7 Project no. 288038

Start date: October 15th, 2011 Duration: 36 months (2011-2014)

Total Cost: € 3.1 million Contact: Michael Gerndt

<[email protected]>

7. AutoTune Project Work Plan

2. Motivation

4. Periscope: the basis for AutoTune

5. AutoTune’s Tuning Framework 3. Project Goals The AutoTune Projectʼs goal is to close the gap in the application tuning process andsimplify the development of efficient parallel programs. It focuses on automatic tuningfor multicore- and manycore-based parallel systems, ranging from desktop systemswith and without GPGPUs to petascale and future exascale HPC architectures.To achieve this objective, AutoTune aims at developing the Periscope TuningFramework (PTF), the first framework to combine and automate both analysisand tuning into a single tool. AutoTuneʼs PTF will…"•  Identify tuning alternatives based on codified expert knowledge."•  Evaluate the alternatives online (i.e. within the same application execution),

reducing the overall search time for a tuned version."•  Produce a report on how to improve the code, which can be manually or automatically applied."

a. Which Tuning Plugins?!"•  GPU programming with HMPP and OpenCL"•  Single-core performance tuning"•  MPI tuning"•  Energy efficiency tuning!!b. How they will be implemented?!!•  Master Agent: responsible for implementing

the overall tuning strategy"•  Analysis Agents: may implement portions

of the tuning plugins"•  MRI Monitor: measures energy consumed

and monitors the GPU infrastructure; may implement region-specific tuning actions (e.g. changing the clock frequency for a specific program region)"

•  C/C++ and Fortran instrumenter: extended for parallel pattern support and HMPP and OpenCL codes"

c. Using which techniques?!"•  Expert knowledge"•  Iterative search"•  Machine learning"•  Model-guided empirical optimization"

6. Goal Validation Achievement of the goals will be measured and evaluated at the projectʼs end:"1.  Applications that can benefit from the tuning techniques will be selected and manually

tuned during the course of the project."2.  At the end of the project PTF will be run over the same applications."The improvements achieved and the required effort for both the manual and the automatic tuning will be compared. It is expected that:"•  PTF will obtain at least 50% of manual improvements, or even surpass them (>100%);"•  PTF will require only a single or a few application runs, compared to effort timed in

months for manual tuning."

Optimize

Test

Measure

Analyze

AutoTune Project

• Specification of the tuning model• Results of the manual tuning of selected applications

demonstrating the potential of the tuning techniques• Detailed technical specification for all work packages• Extended version of Periscope's monitor for the tuning plugins• PTF Demonstrator demonstrating the integration of PA and

tuning plugins

Month12

Month24

• Prototype versions of the tuning plugins• Single plugin tuning strategies• PA strategies for HMPP/OpenCL and energy efficiency• PTF Integrated Prototype demonstrating single plugin

tuning for the selected applications

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Month32

Month36

• Final tuning plugins and combined plugin tuning strategies

• PTF Release

• Documentation of PTF• Detailed evaluation• Demonstration of the promised automatic

improvements for the selected applications

Common performance analyzers only hint at where to tune.AutoTune’s PTF will also tune the code for you!

PTF will take a program written inMPI/OpenMP with/without kernels for GPGPUsin HMPP or OpenCL and will automatically tune it withrespect to performance and energy usage.PTF will generate a tuning report such that a developercan integrate the tuning recommendations for production runs."

Project Consortium

Associate Partner

Funding

ML03poster ContaCts:

[email protected]@[email protected] poster ContaCts:

renato Miceli: [email protected] Civario: [email protected]