Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman...

29
Software & Services Group Processor Performance Counter Monitoring Dr. Roman Dementiev [email protected] Senior Application Engineer Software and Services Group 14 July 2010 1

Transcript of Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman...

Page 1: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Processor Performance Counter Monitoring

Dr. Roman [email protected]

Senior Application Engineer

Software and Services Group

14 July 2010

1

Page 2: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Legal DisclaimerIntel may make changes to specifications and product descriptions at any time, without notice.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development.

64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

Lead-free: 45nm product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). Some EU RoHS exemptions for lead may apply to other components used in the product package.

Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900 PPM bromine and 900 PPM chlorine.

Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

© 2009 Standard Performance Evaluation Corporation (SPEC) logo is reprinted with permission

2

Page 3: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Agenda

• CPU Utilization Monitoring

• Performance Monitoring Units (PMU) in Processors

• Offline analysis with PMU: Intel® VTune™ Performance Analyser

• Online Dynamic Processor Monitoring

3

NEW!

Page 4: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Operating System CPU Utilization Meter

• Most known meter, exists on almost any OS

– Shows how long OS was in the idle/sleep loop

– Worked well with CPUs of 80„s

• But OS CPU Meters ignore– memory access stalls

– synchronisation/locking

– CPU I/O

– Simultaneous multithreading (SMT) – Intel® Hyper-Threading

– etc

• How do I find out what keeps processor busy? Or is my software just wasting compute cycles?

4

Existing OS CPU meters can not predict capacity of modern hardware

Page 5: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

SYSTEM

CPU Utilization Meter in Hardware?

• Modern CPU systems are very complex and consist of many units/resources that influence computation speed

5

SOCKET (CPU) CORE

Page 6: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Performance Monitoring Units (PMUs)

• Intel® processors have Performance Monitoring Units (PMUs) that can be programmed to count many performance-related events– One PMU per logical core (number of elapsed cycles, L1, L2 cache,

TLB events, processed instructions, there are hundreds of events)

– One in PMU uncore (L3 cache, memory controller, Intel® QPI events)

6

Page 7: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Programming PMUs

• Programming by reading/writing Model Specific Registers

• Much of hardware and events are platform specific

• Core PMU is enumerate in CPUID Leaf A:

– Number of fully programmable counters (4 per logical core), a counter is assigned to count a certain event

– Number of fixed function counters exist (3 per logical core): core clocks counter, reference clock counter, instruction counter

• Some uncore and core programmable counters can be only programmed with certain types of events

• Other tricky restrictions apply, restructions are documented in the event list

7

Page 8: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Processor Performance Counters

Publicly documented on intel.com• David Levinthal ”Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™

5500 processors” http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

• Intel® 64 and IA-32 Architectures Software Developer‟s Manual, Volume 3B: System Programming Guide, Part 2 http://www.intel.com/products/processor/manuals/

• Intel® Xeon® Processor 7500 Series Uncore Programming Guide http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf

• Peggy Irelan and Shihjong Kuo “Performance Monitoring Unit Sharing Guide ” http://software.intel.com/file/20476

Intel® Hyper-Threading Technology-specific:

• Drysdale, Gillespie, Valles “Performance Insights to Intel® Hyper-Threading Technology” http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/

• Gillespie, Drysdale “Intel® Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload” http://software.intel.com/en-us/articles/intel-hyper-threading-

technology-analysis-of-the-ht-effects-on-a-server-transactional-workload/

8

Page 9: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

PMU Sampling Mode: The Statistical Method of Finding Hotspots

• A sampling collector (like VTune™ Performance Analyzer or Intel® Performance Tuning Utility)

– PMU periodically interrupts the processor

• Triggered by the occurrence of a certain number of events

– Collects the execution context

• Execution address in memory (CS:IP)

• Operating system process and thread ID

• Executable module loaded at that address

– If you have symbols for the module, post-processing can identify the function or method at the memory address.

– Line numbers from the symbol file can direct you to the relevant line of source code.

Page 10: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Introducing Intel® VTune™ Performance Analyzer

• Helps identify and characterize performance issues by:

– Collecting performance data from the system running your application.

– Organizing and displaying the data in a variety of interactive views, from system-wide down to source code or processor instruction perspective.

– Identifying potential performance issues and suggesting improvements.

– Providing application profiling information

– Provides Tuning assistant and great help system

• Besides sampling analysis with PMU can also produce call-graph(not covered here)

Page 11: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Just a few things you can do with processor performance events

• Check if your software is NUMA-optimized (local/remote memory accesses)

• Cache-local or not

• Memory bandwidth bound or not

• Branchy or not (branch misspredictions)

• Has „bad“ long latency instructions on critical path

• Has performance bugs in multithreaded programs ( false-sharing,…)

• Exploits instruction parallelism well or not• See also the article „Using Intel® VTune™ Performance Analyzer to Optimize

Software for the Intel® Core™ i7 Processor Family” http://software.intel.com/en-

us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor-family/

11

Page 12: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

DEMO

• Intel® VTune™ Performance Analyzer in action!

12

Page 13: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

13

Select Event:

Clock ticks,

L2/L3 cache misses,

branch

misspredictions,

etc.

Offline Analysis: VTune™ Performance Analyzer Sampling

Collector

Page 14: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Offline Analysis: Intel® VTune™ Analyser

14

Hotspot view of one

module for all OS

processes and threads

grouped by function (or

method).

Offline Analysis: Intel VTune™ Performance Analyzer Sampling Collector

Page 15: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

15

Sampling Source View Displays

Source Code Annotated with

Performance Data

Page 16: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

PMU Counting Mode

• No interrupts generated

• Application reads (periodically) the number of occured events from the PMU counters

• Very small overhead

• Advances online use-cases possible: next slides

16

Page 17: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Online Performance Counter Monitoring: Access Intel® CPU Counters* in Your Program

Terminology:

• System consists of several sockets (=CPUs)

• Socket has a number (logical) cores

Usage pattern

1. Save counter state for {core,socket,system} into a state object 1

2. Run user code or experiment

3. Save counter state for {core,socket,system} into a state object 2

4. Using state object 1 and 2 compute performance/utilization metrics

Caution: OS may schedule different user threads on the same core (context switches)

17

Access not only core counters (clock ticks, L2 cache misses, etc) but also uncore (Intel® memory controllers, Intel® QPI, etc) counters*

NEW!

* Implemented for Intel® Core™ i7, Xeon® 5500, 5600 and 7500 Processor Series (based on microarchitecture codenamed Nehalem/Westmere)

Page 18: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Example C++ code

Monitor * m = Monitor::getInstance();

if(m->good()) m->program(); // program counters

SystemCounterState before_sstate, after_sstate;

before_sstate = getSystemCounterState();

[run your code here]

after_sstate = getSystemCounterState();

cout<<“IPC:“<< getIPC(before_sstate,after_sstate)<<

“L3 cache hit ratio:” <<

getL3CacheHitRatio(before_sstate,after_sstate) <<

“Bytes read:”<<

getBytesReadFromMC(before_sstate,after_sstate) <<

[and so on]…

18

Page 19: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Example 1

• Compare traversal/searching in the STL list vs. STL vector (4 byte records)

• C++ code to measure:

19

std::find(

ds.begin(),

ds.end(),

ds.size());

Get CPU performance insights in real time

Page 20: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Intel® Performance Counter Monitor* (Linux*/Windows*)

20

Easily collect CPU performance data

*the name might be changed in future

Page 21: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Linux* KDE* plug-in

21

Visualize CPU performance in real time

Page 22: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Advanced Examples

• Software reads data from PMUs in online fashion

22

Self-tuning software !!

NEW!

Page 23: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Example 2 “CPU resource“-aware scheduling

• Problem (a simplified one):

– schedule 1000 compute-intensive and 1000 memory bandwidth intensive jobs on a single core

– jobs are equal in size

– background unknown activity exists

• Goal: minimize total completion time

23

Page 24: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

CPU Monitoring Unaware Scheduler

24

time

Memory-band

intensive

background

activity

compute

intensive

jobs

memory-

bandwidth

Intensive

jobs

11

11

Page 25: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

CPU Monitoring Aware Scheduler

25

time

Memory-band

intensive

background

activity

compute

intensive

jobs

memory-

bandwidth

Intensive

jobs

12

13

In an experiment with 2000 jobs we measured 16% faster completion time*

•Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

Page 26: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Advanced Use-Cases I

• Extend the problem (to be closer to reality):– Schedule to all Hyper-Threaded cores in the system

– The remaining capacities are not known a priori because the jobs are not predictable in exact resource utilization

• Do we have a room to put another job on this HT core?

– Should it be compute intensive or rather memory intensive job?

• CPU Performance Monitoring can provide more insights and help to answer these questions

26

Page 27: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Advanced Use-Cases II

• Depending on remaining resource capacities choose the best algorithm to compute result

– mem-intensive or

– compute-intensive

• Choose between implementations

– single-threaded or

– multithreaded (all cores) or

– with limited threading

– and, so on…

27

Page 28: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

Conclusions and Takeaways

• Current OS CPU utilization meters are not suited for modern hardware

• Modern processor PMUs provide metrics to get deep insight into processor performance and resource utilization

• Processor performance counters are heavily used in established performance tools like Intel® VTune™ Performance Analyser

• New advanced use-cases for PMUs for dynamic online optimization possible– new kind of intelligent CPU-monitoring aware software

28

Page 29: Processor Performance Counter Monitoring · Processor Performance Counter Monitoring Dr. Roman Dementiev roman.dementiev@intel.com Senior Application Engineer Software and Services

Software & Services Group

29