Intel Performance Bottleneck Analyzer User’s Guide · PDF fileIntel® Performance...

29
Intel ® Performance Bottleneck Analyzer User’s Guide

Transcript of Intel Performance Bottleneck Analyzer User’s Guide · PDF fileIntel® Performance...

Intel® Performance Bottleneck Analyzer User’s Guide

2

Contents

Quick Setup Instructions for Intel® Performance Bottleneck Analyzer Package ............................. 3

Prerequisites: ............................................................................................................................... 3

Installing Intel® Performance Bottleneck Analyzer ................................................................. 3

Collecting TB5 files and Analyzing Data with Intel® PBA ......................................................... 4

Introduction to the Framework ..................................................................................................... 12

Intel® Performance Bottleneck Analyzer Granularity Definitions .................................................. 14

Intel® Performance Bottleneck Analyzer Capabilities .................................................................... 17

Intel® Performance Bottleneck Analyzer Report GUI ................................................................ 17

Latest Architecture Support ...................................................................................................... 21

LBR Support ............................................................................................................................... 21

TB5 Database Backend .............................................................................................................. 21

Linux and Mac Parsing Support on Windows ............................................................................ 21

Load Latency .............................................................................................................................. 21

Multiplexing Support ................................................................................................................. 22

Slow Frames Analysis ................................................................................................................ 22

Power Correlation Capability .................................................................................................... 23

Templates .................................................................................................................................. 23

Top Down Counter Analysis Using “Observations” ................................................................... 24

Known Issues ................................................................................................................................. 25

Papers/blogs on Intel® PBA Support for Latest Architectures ...................................................... 26

Meet the Intel® Performance Bottleneck Analyzer Design Team .................................................. 27

Acknowledgements ....................................................................................................................... 29

Bugs ............................................................................................................................................... 29

3

Quick Setup Instructions for Intel® Performance Bottleneck Analyzer Package

Prerequisites:

- Java run time (JRE) version 6 Update 10 or greater http://www.java.com/en/download/manual.jsp

- Intel® VTune™ Amplifier XE – needed for parsing sampling collector data http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/ Failure to install this will cause ‘valid license not found’ error during analysis

stage - Sampling Collector – for data collection

Included with the Intel® Performance Tuning Utility (Intel® PTU) v 4 update 5 or later package

http://software.intel.com/en-us/articles/intel-performance-tuning-utility/

Note: To install the Sampling Collector from Intel® PTU, open a command window in an Administrator mode and run “sepreg.exe -i” from the <ptu_home>\bin\ directory.

System reboot is required if this is run for the first time on the system. No reboot for subsequent installs/updates.

See 'SEP_Install_Instructions' located in <PBA_home>\docs folder for more detailed installation instructions.

Note: It is required to collect programmable clockticks event (CPU_CLK_UNHALTED.THREAD_P/ CPU_CLK_UNHALTED.CORE_P) for multiplexed dataset (already included in the scripts for PBA)

Installing Intel® Performance Bottleneck Analyzer

1. Install Intel® PBA by extracting the downloaded zip file on your system, having spaces in the

directory name where PBA installed causes crashes while data collection.

2. Navigate to find the xIF.lnk shortcut in the directory where the zip file is extracted and open

it to launch the Intel® PBA command window.

Note: Do not open this link with administrative privileges as this will change your base

directory and cause environment to not get set correctly.

3. Set the Sampling Collector run time environment before starting to collect sampling data

a. Navigate to where the Sampling Collector is installed in Intel® PTU. Typically it

should be located here: <ptu_home>\bin

b. Run ‘setup_sep_runtime_env.cmd’ to set the Sampling Collector environment

4. Navigate to the desired working directory. Then, for standard Intel® PBA analysis, see the

‘Intel® PBA text-based assistance utility’ section below to get started.

Note: If Intel® VTune ™ Amplifier XE cannot be installed on the system for any reason, the valid

license file purchased for the product should be copied in the path so that Intel® PBA can run

4

successfully (e.g. copy the license file at C:\Program Files\Common Files\Intel\Licenses. For 64-

bit systems use ‘Program Files (x86)’ folder)

Collecting TB5 files and Analyzing Data with Intel® PBA

Collecting and analyzing data can be done in 3 ways:

Intel® PBA text-based assistance utility

o Contains simple prompts for new users to learn the flow of collection and

analysis

Intel® PTU graphical interface

o Intel® PBA has integrated with other tools for ease of collection and analysis

Collections directly using the Sampling Collector

o For more advanced users, Intel® PBA offers options to collect directly using the

Sampling Collector

NOTE: While collecting data on Intel® Core™ i7 and Intel® Core™ i7 Extreme architectures, user

may see an error “Invalid Event Discarded: event_name”. Continue with data collection as it will

still collect valid data. This happens to due change on event names on processors.

Intel® PBA text-based assistance utility

The text-based assistance utility, ‘xiflauncher’, helps collect, analyze and view data for new

users. This utility is meant to teach the flow of analysis to new users but experienced users will

most likely want to switch to collecting and analyzing manually for more flexibility.

1. Collect data using xiflauncher

a. Run ‘xiflauncher’ command at the command prompt

b. Setup Sampling Collector paths as mentioned above using

setup_sep_runtime_env.cmd

c. You will see options to collect data/ analyze data/ view data

d. Select option to ‘Collect Data’

e. Follow the instructions on the screen to collect the data

f. We recommend that Last Branch Record (LBR) and multiplexing data to be collected

for the architecture under analysis

2. Analyze data using xiflauncher

a. Select option to ‘Analyze Data’

b. Follow instructions on the screen to analyze the data. Since, Xiflauncher only

handles strict case of run on Windows*, it assumes that all tb5 are in tb5 directory

of working directory.

c. Since Intel® PBA assumes that user knows the module of interest, if it is not known

run the analysis on application exe and check the ‘module’ tab when results are

opened to double check most time consuming module.

5

3. View data using xiflauncher

a. Once the Analysis is complete, you can close the analysis command prompt that is

opened by the previous step

b. Select option to ‘View Data’

c. This will open the Intel® PBA GUI to the default Module selection screen. See the

‘Intel® Performance Bottleneck Analyzer Report GUI’ section for more on how to

navigate through the data.

Collecting and analyzing data using Intel® PTU

Users can use Intel® PTU to collect and analyze Intel® PBA samples.

Collect Intel® PBA data:

To collect Intel® PBA samples open eclipse from <ptu_home>\eclipse directory. Navigate to

preference in Windows -> Preference. Add Intel® PBA report location to the path and click OK as

shown in figure 1 below.

Figure 1 Intel(R) PTU Preference Window

Create a new project by navigating to the ‘New Project’ in the File Menu.

In a new project wizard select New Intel® PTU project.

6

Figure 2 Intel(R) New Project Wizard

Enter the project name and click next. In workloads details window, enter profile information to

be collected and click finished. This will create a new project on your tuning navigator windows.

Right click on your project -> Profile As and select Intel® PBA Client Profiling to collect event

samples.

7

Figure 3 Intel(R) PBA Profile Selection

Analyzing data using Intel® PTU-PBA tool

After completion of collections, right click in output window and select Intel® PBA Report. This

will run Intel® PBA and open the GUI with event analysis.

Figure 4 Intel(R) PBA GUI Report

8

Collecting Sampling Data for Intel® PBA Directly Using the Sampling Collector

This section takes you through the process of collecting the profile data. It gives you enough

information to begin analyzing the sampling data using Intel® PBA framework on recent Intel®

architectures.

There are two modes of full data collection for Intel® PBA full analysis:

- Multiplexing Mode (Recommended): Collects all events required in a couple of runs.

- Single Collection Mode (Not Recommended): Collects one set of events per run over

numerous runs. Intel® PBA must have at least the clockticks and instructions retired

events for the platform to run basic analysis.

Multiplexing Mode

We utilize multiplexing as a default mode to collect necessary data. The following steps will

setup your Sampling Collector with multiplexing collection.

First, set up your Sampling Collector environment for collection. Utilize the multiplexing scripts

we have provided to ensure the correct Sampling Collector data is collected. We maintain

support for Intel® Core™ i7 microarchitecture, Intel® 2nd Generation Core™ Processor and

Intel® Atom™ multiplexing scripts in the Intel® PBA Framework. Script files are placed under

<PBA>\Scripts\Multiplexing directory.

- NHM_runmp1.cmd: Collects critical and high priority events required on Intel® Core™

i7processors.

- WSM_runmp1.cmd: Collects critical and high priority events required on Intel® Core™

i7 Extreme processors.

- SNB_runmp1.cmd: Collects critical and high priority events required on Intel® 2nd

Generation Core™ processors.

- BNL_runmp.cmd: Collects critical and high priority events required on Intel® Atom™

processors

- NHM_runlbr.cmd/WSM_runlbr.cmd/SNB_runlbr.cmd/BNL_runlbr.cmd: Collects Last

Branch Record (LBR) data

- NHM_loadlatency.cmd/WSM_loadlatency.cmd/BNL_loadlatency.cmd: Collects load

latency data

- SNB_runPDIR.cmd: Collects precise instructions retired for a more accurate basic block

hit count.

For more details on Last Branch Record and Load Latency, see ‘Intel® Performance Bottleneck

Analyzer Capabilities’ section.

The script takes two arguments, the duration of collection time and tb5 filename

All the multiplexing scripts will start Sampling Collector in non-pause mode which means the

collection will start immediately.

9

Note: You can add –sp option to start the collection in pause mode in the multiplexing

script.

After you have finish setting up your system with correct multiplexing script and database file,

run the multiplexing script from command prompt in an administrator mode for 1-4 minutes

depending on the CPU utilization. Example: If an application workload has 100% of CPU usage

you can run multiplexing script for 80-120 seconds. If an application workload has 25%-50% of

CPU usage you must run multiplexing script greater than 180 seconds for accurate profiling.

Example – NHM_runmp1.cmd 60 abc.tb5

Where 60 is duration is seconds while abc.tb5 is the output file name.

Single Collection Mode:

In single collection mode, we collect basic events – Clock-ticks and instruction retired. To collect

basic events, user can run sep –start in <ptu>/bin directory and later analyze using xiflauncher.

Figure 5: steps to analyze data

Analyze data using ‘scripts’ in Intel® PBA

10

Intel® PBA provides sample scripts which can be used to run the most typical types of analyses

at various granularities. The scripts can be found in the PBA_ROOT\scripts folder and they are

detailed below.

Full Analysis

Script Name: full_analysis.cmd

This will run full issue analysis along with streams, spikes, loops in multiplexing and non-

multiplexing mode. This is the most common type of analysis run in Intel® PBA

Usage

full_analysis.cmd <module_name> <tb5_dir_name> <SNB\MRM\PNR\NHM\BNL>

Example: full_analysis.cmd abc.exe myTb5Dir SNB

This command will generate a “db” directory which contains a database of output which is read

by the Intel® PBA GUI to view the results.

To open .xif files, copy run_gui.cmd file from <PBA>\scripts\

Example: run_gui.cmd

Go to “File->Open” and select results.xif file. Please refer to the “Intel® Performance Bottleneck

Analyzer Report GUI” section for additional information on how to use the GUI to navigate

through the data.

Compare Architectures with Full Analysis

Script Name: compare_full_analysis.cmd

This will run comparison analysis between two architectures including issue analysis along with

streams, loops, spikes.

Usage

compare_full_analysis.cmd <module_name> <tb5_dir_name_arch1> <tb5_dir_name_arch2>

<SNB\MRM\PNR\NHM\BNL> <SANDYBRIDGE\MEROM\PENRYN\CORE™ I7\BONNELL>

Example: compare_full_analysis.cmd abc.exe myTb5Dir myTb5Dir2 SNB CORE™ I7

The first architecture parameter (and first tb5 directory name) is for the primary architecture

under evaluation. All of the issues identified will be for the primary architecture. The second

architecture parameter is for the competitive architecture. This can be useful to compare

against data analyzing the same binary but running on a different architecture or configuration

11

(e.g. Intel® Core™ i7 vs. Intel® 2nd Generation Core™ , SMT off vs. SMT on, different cache

sizes, etc.)

This analysis creates the same files as described above along with the secondary architecture

information populated as comparison to the primary architecture. Use the Viewer GUI to open

.xif files as described above and see the “Intel® Performance Bottleneck Analyzer Report GUI”

sub-section for additional information on how to view the results.

Additional Options of Interest

There are some additional options which can be added to the command line to enable different

features of Intel® PBA. Note that this is an advanced usage of Intel® PBA, so support for issues

encountered is limited. Below is an example of how to add an option to your analysis.

Usage example from full_analysis.cmd

Add an option (highlighted) to the command line.

java -Xmx1300m xIFJava.Main -f "%~1" -jdir "%~2" -%3 -csvoutput

Here are a few options which can be added/modified which may be useful.

1. -csvoutput

a. This option will output various csv files of data similar to what is displayed in the

GUI. An example of this is fuction_overview.csv which contains function

hotspots and counter data. Note that the preferred usage is to view data in the

GUI, but there are also some data in the csv’s that are currently unavailable in

the GUI. One example is multilineissues.csv file which contains the full assembly

for issues which span multiple Instruction Pointers such as store forward

blocked.

2. -Xmx<number of MB for Java heap>m

a. Note that this Java™ option is already set at 1300 in the sample scripts, but may

need to be increased for large datasets when additional memory is available.

12

Introduction to the Framework Intel® Performance Bottleneck Analyzer (aka xIF = x86/64 Issue Finder) was written to provide a

framework which can be used to identify and prioritize issues specific to current and future

Intel® architectures. The framework accomplishes this goal through the following capabilities:

Static assembly analysis

- An executable, shared library, or object file is disassembled and analyzed for known

issues in Intel® architecture

- Event based profiling must be present so issues found can be prioritized utilizing

clockticks and hit counts

- Assembly latencies are provided for the following architectures:

Intel® 2nd Generation Core™ , Intel® Atom™, Intel® Core™ i7, Intel® Core™ 2

Duo.

Correlation of event based sampling and assembly

- Utilize event based sampling to reconstruct instruction stream of execution as it ran

on the CPU in nanosecond time

- Correlate issues found with events with knowledge of the surrounding assembly at

any granularity

- Embeds any events collected alongside execution profile

Comparison between architectures

- Architectures covered:

Intel® 2nd Generation Core™ , Intel® Atom™, Core™ i7, Intel® Core™ 2 Duo

- Viewing data at granularities of module, function, instruction stream, and basic

block

- View execution profile of various architectures with events and assembly latencies

embedded

Below is a diagram of the basic flow of analysis

13

Figure 6 - Intel® PBA Execution Flow

14

Intel® Performance Bottleneck Analyzer Granularity Definitions Basic Block

Intel® Performance Bottleneck Analyzer determines where the basic-block boundaries exist

based off of where branches and entry points exist within the code as well as utilizing data from

the LBR (last branch records). The formal definition of a basic block is any assembly code that

will run continuously without branching. All instructions within a basic block will have the same

execution frequency. Analysis at the basic block level negates some of the effects due to skid

and allows us to look at reproducible data on an application. Tool also looks at the all calls and

indirect branches to indentify entry points. Each entry point is seen as a start of a new basic

block boundary as well.

Basic Block Example

lea edx,ds:[300A2CB8h] //Clock% 0.01

neg ax //Clock% 0.4

mov bl,al //Clock% 0.2

and ebx,0Fh //Clock% 0.9

je 30C83658 //Clock% 1.3

Total_Block_Clock = 2.81% total process clockticks

Spike Explanation

The spike object is anywhere in the profile where tool sees a push-out in the retirement of

instructions signifying a bottleneck in the pipeline. We use a simple heuristic to determine

where spikes occur in our streams and mark them so we can attempt to explain the bottleneck

with events and static assembly analysis. The spike object undergoes more analysis than other

portions of the application. Spikes usually occur on the instruction after the bottleneck occurs

due to a typical IP+1 (IP = Instruction Pointer) on cost of a major bottleneck in sampling. The

example below shows an example of three spikes identified in a stream.

15

Stream and Loops Explanation

The streams granularity represents common paths of execution in the application. They are

created through utilizing LBR (last branch record) data and basic block hit counts to recreate the

most common paths of execution on the chip. In the absence LBR data tool will still create the

streams granularity but has to make some dangerous assumptions on the path of execution. A

loop is just a specialized stream that starts at the head of the loop and ends at the tail of the

loop. The streams granularity is fundamental to studies because it accomplishes 3 goals:

1) Gives the performance engineer the context of the performance issue

2) Allows the engine to analyze a much larger window of code generation for

analysis

3) Helps account for skid

Example of a flow of execution of a stream:

30aa668a: E8 68 call 30aa63f7 //Jump to address 30aa63f7

30aa63f7: 8B 0F mov ecx,dword ptr [edi]

16

30aa63ff: C1 FB 18 sar ebx,18h

30aa6403: 8B F0 mov esi,eax

30aa6405: 0F 88 CD js 30AA64D8 //Fall thru to next IP

30aa640b: 0F B7 47 04 movzx eax,word ptr [edi+4]

30aa640f: 66 3D FE FF cmp ax,0FFFEh

30aa6413: 74 66 je 30AA647B //Jump to address 30aa647b

30aa647b: 0F B7 movzx eax,word ptr [edi+4]

HyperBlock Explanation

A HyperBlock is a set of instructions presented in retirement order similar to a stream. In a

HyperBlock no other path of execution significantly jumps out of the middle of a HyperBlock or

into the middle of the HyperBlock. All instructions within a HyperBlock will have the same

execution frequency. The stream granularity explained above is essentially a chain of

HyperBlocks in the most typical order that they were executed. The HyperBlock granularity was

created for the following reasons:

1) Code frequency statistics are more accurate since a typical HyperBlock is much

larger than a basic block.

2) Branch statistics on every basic block was too much information so the

HyperBlock granularity helps make it readable.

3) Feeding paths of execution that run at the same frequency was necessary to

interoperate with assembly analyzers such as Intel® Architecture Code Analyzer

(IACA).

17

Intel® Performance Bottleneck Analyzer Capabilities Intel® PBA is designed, written and maintained by performance engineers who are actively

engaged with software vendors. Every feature written for Intel® PBA has been used to study

and identify performance opportunities for our ISV customers. The framework was designed to

utilize knowledge of processor events and static assembly analysis to automatically explain

performance bottlenecks. The bottlenecks which cannot be explained are prioritized and tagged

for further analysis. The team has been analyzing our own applications with the Intel® PBA

framework on Intel® 2nd Generation Core™ and Intel® Atom™ for ~1 year. The toolset is now

aware of over ~100 events on the Intel® 2nd Generation Core™ architecture and can associate

those events with patterns it finds with its static assembly analysis.

Intel® PBA has the following new components:

Intel® Performance Bottleneck Analyzer Report GUI The GUI has been designed by Seung-Woo and Erik. Intel® PBA has a GUI to look at the analysis

from many different granularities and link them to issues and other insights that tool provides.

Intel® PBA outputs a set of directories containing the database for the analysis. The root

directory for the output is always in the “db” directory where xIFJava was run (Intel® PBA

analyzes one module at a time). You can run analysis on several modules in multiple runs. All the

analysis will be merged and stored in the “db” directory.

18

GUI Overview

0. File->Open: Opens the results of the analysis.

a. The result of analysis is saved to

results.xif and “db” directory in the same

folder.

b. When multiple analyses are performed

on multiple modules, the results are

merged in the “db” directory.

c. Opening results.xif will load all the

merged results of the analyses.

1. Options: sets

the default

behavior of

GUI.

a. Percent View: toggles percent vs. event

count in GUI.

b. Stream / Loop View: shows either

streams or loops.

c. Domain (x) axis: changes x axis in the

chart.

2. Chart: Shows the data in the chart form. The

chart currently showing is determined by

“Selection Granularity” in (5).

3. Events: Shows the list of events to display.

a. TB5: raw Sampling Collector events.

b. Costs In Domain: Ratios defined in the

template.

c. Observation: High level stats.

19

4. Module Summary: various stats for the selected

module. Chart and Selection Granularity table

shows the data from stream / loop’s perspective.

The tables here show the collection of different

objects and granularities grouped by the same

kind.

a. Spikes: shows the list of all spikes for the

selected module. Moving the cursor on

the “Explained%”, “Load BreakDown”, or

“LFB Breakdown” shows further

information in the tooltip.

b. Right clicking on any row in the tables

pops up the streams / loops containing

the selected object.

c. Other entities show their own collections.

The information varies depending on the

objects and the events selected.

5. Selection Granularity: selects the granularity of the objects to analyze. They are arranged in

hierarchical order – Module contains Stream / Loop, Stream / Loops contains HyperBlock,

HyperBlock contains Basic Block, etc. This controls what’s showing in the chart area in (2) as

well.

a. Module: Shows the list of modules and their associated data selected in the events

window in (3).

i. In the above figure, only the clockticks is selected for Intel® 2nd Generation

Core architecture in (3).

ii. xxx.exe is consuming 99.16% of all clockticks.

iii. Selecting xxx.exe in the chart automatically selects the same in the module

table and vice versa.

iv. Double clicking the bar in the chart (xxx.exe) will drill down to stream

selection view for the module. Or you can select “xxx.exe” and click on

“Stream” tab in the selection granularity window.

b. Stream: Shows the list of streams and their associated data selected in the events

window in (3).

i. Selection on the chart and table works the same way as in module.

20

ii. Double clicking a stream in the chart will switch the view to the instruction

line level, though.

iii. Actually HyperBlock, BasicBlock, and InstructionLine granularity will show

the same graph in the chart. However, the selection granularity in the chart

will be different. For example, there are 6 lines in the corresponding

hyperblock.

c. HyperBlock / Basic Block / Instruction Line: displays the sequence of instructions

along with the associated data selected in the events window (3).

i. In the upper right corner of the chart, the overall issues found in the

currently selected stream are displayed along with the impact of the

individual issues to the stream.

ii. The big circles in the chart are the spikes. Small circles around the spike

show the specific type of information for the spike. Clicking on the big circle

cycles through the information on the label. You can drag the label around

for the better viewing.

21

Latest Architecture Support The tool supports latest and greatest Intel® architectures such as Intel® 2nd Generation Core™

and Intel® Atom™. We have been working directly with architecture teams to put in a lot of

support for architecture specific bottlenecks. See Intel® Atom™ and Intel® 2nd Generation

Core™ studies section below for more details.

LBR Support Charlie Hewett has tackled grabbing full LBR (last branch record) data. This functionality results

in more accurate reproductions of path of execution, accurate hit counts on basic blocks and

capability to output branch statistics. This also helps us create our new base granularity named

HyperBlocks. In the near future we will use this to produce a statistical call graph. LBR support

is only available on the Intel® Core™ i7, Intel® 2nd Generation Core™ and Intel® Atom™

architectures. Precise instructions retired (PDIR) is also collected to assist with basic block hit

count on Intel® 2nd Generation Core™ architecture.

TB5 Database Backend Rajshree, Joe and Erik tackled this to interface directly with the Intel® VTune™ Amplifier XE team

to integrate their backend database creating a data access layer to feed counter data to the tool

engine.

Linux and Mac Parsing Support on Windows Joe has spent a lot of time ensuring that Linux and Mac analysis work. Sampling Collector data

collected on either a Linux* or Mac* OS X system can be moved to a system running Intel®

Performance Bottleneck Analyzer and analyzed as usual. Simply copy the TB5 data set, along

with the necessary binaries (with or without symbols stripped) and follow the steps outlined in

this document.

For OS X*, the Intel® Performance Bottleneck Analyzer only recognizes the Mach-O binary file,

rather than the .app package. The binary for such a package can often be found inside under

MyApp.app/Contents/MacOS. In addition, Sampling Collector on OS X is still under

development, and may not be immediately available at the time of this Intel® Performance

Bottleneck Analyzer release.

The current release of Intel® Performance Bottleneck Analyzer does not support running on

Linux* and OS X* at this time. Please contact the Intel® PBA development team for more

information and updates.

Load Latency Load latency event is now being used to breakdown the dreaded LFB (line fill buffer) source that

can incur any latency. This feature implemented by Peter and Rajshree has helped us also

determine whether Intel® AVX loads are missing in the L1D or not. Load latency is only available

on Intel® Core™ i7, Extreme and Intel® 2nd Generation Core™ architectures.

22

Load Latency data provides additional information on load cost in two cases where the precise

load events are ambiguous:

There is a known issue with the precise events and Intel® Advance Vector Extension

(Intel® AVX) 256 bit loads. For these loads the precise events data will look as if loads

are always satisfied from L1D or Line Fill Buffer (LFB). The tool can use load latency data

to substitute a more accurate breakdown of probably sources based on load latency.

For all loads from LFB, the actual cost can be highly variable. The tool can use load

latency data to provide a supplemental breakdown of likely sources of data loaded from

LFB, to provide a better picture of the overall cost of a load with a large share of LFB

samples.

Note: Due to truncation of %s to 1 decimal point in GUI, we may see a case where 0.0%

shows coming from LFB using load breakdown, but still shows LBF breakdown when load

latency counter is collected.

Multiplexing Support The tool is moving to accomplish a full analysis off of a single run. Each command line and shell

scripts for accomplishing a multiplexed run on the architecture are included in our scripts folder.

Multiplexing support has been thoroughly tested on non-steady state workloads due to some

great engineering work by Manuj Sabharwal. Several post processing checks have been

implemented to ensure that the multiplexing data is representative of the entire run. The only

draw backs today is that LBR data, load latency data and Precise Instructions Retired (PDIR)

cannot be collected with multiplexing and will need to be collected in separate runs. The LBR

and load latency runs are not required for analysis but we recommend running them.

Slow Frames Analysis Slow framerate profiling was created to compare data from slow frames with data from fast

frames and determine everything that is different between the two data sets.

This is accomplished by instrumenting the application binary with calls to record the time stamp

counter (tsc) before and after the frame is started and finished. The delta is then checked

against user input and recorded if the value is greater than the time required to hit the

minimum framerate specified.

The times of the start and end of the slow frames is put into a csv file which is read by the tool in

order to perform the analysis between the two data sets

A follow up blog will be published explain usage of this feature in detail

23

Power Correlation Capability Intel® Performance Bottleneck Analyzer provides a way to correlate CPU power data from

NetDAQ* analysis (CSV file created by NetDAQ*) with the performance data collected by

Sampling Collector. Power and performance data is collected at the same time for any workload.

The correlation is then achieved by using time stamp events from the NetDAQ* data and

Sampling Collector tb5 files. Using the CPU power data from NetDAQ*, tool creates 2 bins of

data for comparison. High power bin is setup at upper 20% of power limit from CPU power. Rest

becomes low power bin. Intel® PBA then compares these power bins to identify hot modules

and functions which are most active in high power area as compared to low power area.

Here is what we get in module view for an ISV workload

The data indicates that at module level, xxx.dll spends more time in high power bin.

Further drilling down to function level indicated that a spin wait loop is more active in high

power bin. Discussion with ISV indicated that __pause was not used in the spin wait loop.

Thanks to Rajshree, George and Jun for adding this functionality.

Templates Templates provide a way to customize issues and ratios that user want to see at run-time.

Templates also provide a way to override cost of any architectural issue. If the costs are not

overridden in templates, the default cost per architecture are achieved from arch layer. These

are populated per architecture in a CSV file stored in PBA_ROOT\templates folder.

For Intel® 2nd Generation Core™ architecture the template file includes ratios called

‘OBSERVATIONS’. Observations are way to determine what stage of the pipeline the code is

bottlenecked on. We have 4 high level observations: FrontEnd (and further breakdown of how

many uops are delivered by FrontEnd per cycle), BackEnd, BadSpeculation (mispredicted uops)

and Retiring. We have also included sub-categories inside these top 4 categories to identify

stages of the pipeline those are bottlenecks. Observations don’t have any cost associated with

them since these are not actual issues, but indication of issues. Percentages associated with

observation factors provides an indication at higher level on what is primary bottleneck at each

object granularity which can be used to zoom in further on issues.

We are currently working on breaking those higher observations down further to zoom into

each part of the pipeline. Currently these are not in any hierarchical order; this will be included

in future releases.

24

How to add an issue to templates

1. Open required architecture’s template CSV file

2. Add issue name, rule type (e.g. dynamic) and issue description

3. Add numerator and denominator events that would be used to calculate ratios for dynamic

issues in the (). For these events, we have support for event math such as (eventA-

eventB+eventC).

4. MulFactor is the static cost for the issue as per architecture selected. E.g. cost of LLC_MISS

on Intel® Core™ i7 is typically ~200 clocks (which is stored in arch layer by default). But if

your calculated cost run shows LLC_MISS as 180 clocks instead, you can change the

MulFactor in the templates CSV file which will automatically override the default cost. This

helps customize the issue impact per application.

Currently ratio_high and ratio_low are same values. These indicate the threshold to

check for in your application before tagging an issue to it. E.g. if

LLC_MISS_CounterName/Clockticks*200 > 0.05, then we have LLC_MISS issue.

5. Non-Supported Objects column provide a way to exclude any object granularity from

applying the specified rule. E.g. if we want to apply a rule only for loop level analysis, we can

exclude other object types such as streams, blocks, modules, functions from specific rule by

adding the object types here. The valid object types are: ModuleData, FunctionData,

LoopData, StreamData, SpikeData, BlockData

6. PriorityFactor indicates the priority of the issue within the hierarchy that we maintain. This

is based on how much the issue would typically cost if this is found. We use priority factor to

provide appropriate weight to multiple issues found at an object granularity so we can

account for issues found like LLC_MISS before putting any weight on front end issues.

Top Down Counter Analysis Using “Observations” We have added initial capability to perform top down counter analysis on Intel® 2nd Generation

Core™ by enabling “Observations”. Observations help to analyze CPU execution at a high level

and then drill down in a structured manner to identify the true bottleneck(s). Charlie Hewett

has been working with Ahmad Yasin to get the ratios defined in our templates file.

25

Known Issues 1) Intel® Performance Bottleneck Analyzer only supports client workloads and processors.

Analysis can be attempted for HPC or server workloads and cores but is not supported.

2) Analysis of managed code (e.g. Java or C#) with Intel® Performance Bottleneck Analyzer is

not supported at this time

3) Analysis on the Linux kernel cannot be fully trusted from Intel® PBA. The toolset has a

known issue of dropping samples from the kernel analysis which will be fixed in the next

revision of the tool. The tool will output a warning after analysis indicating that samples

have been dropped.

4) If you overwrite newly collected tb5 file with same name as old tb5 file, make sure to delete

the temp_data folder. Since we only check tb5 file names at the moment and not time\date

stamps.

5) Statically found issues may get higher cost based on hit count

6) Load latency data only available at spike level for instruction before pinnacle of the spike

7) LBR data may have invalid addresses on certain platforms and configurations. We have

been debugging what we believe to be a firmware issue on single socket Intel® Core™ i7

Extreme based platforms where the addresses returned are bogus. We have implemented a

check into Intel® PBA and in the scenario that the addresses are invalid, you will see the

following message to the console output and Intel® PBA execution will continue without LBR

data:

WARNING: Total number of taken branch (i.e. usable) Lbr

samples was: 0

Execution will continue, but LBR analysis will not be

available.

If you see this issue, you may need to collect data on Intel® Core™ i7 or Intel® 2nd

Generation Core™ or Intel® Atom™ to get valid LBR data.

8) GUI only displays top 20 streams, loops, modules in the bar graph. But the table below

contains entire list

9) On Intel® 2nd Generation Core™ observations, events haven’t been fully validated under

SMT case.

10) When running comparison analysis on full dataset (LBR, load latency) on two architectures,

it is strongly recommended to run on system with 4GB of memory and increase the Java

heap size to 3GB instead of default 1.3GB by editing the compare_full_analysis.cmd file. A

crash is likely to happen with 1.3GB heap option.

26

11) For Windows* XP only, please install the re-distributable below

1. Microsoft Visual Studio* 2005 SP1 redistributables 2. Microsoft Visual Studio* 2008 redistributables

12) When resolving or copying the binaries and/or symbols, Intel® PBA Launcher can give Error: “Module does not exist”. Possible issue is windows user access control blocking the access.

13) On certain Intel® Core i7 Extreme processors, load break down may not give correct information when the data is collected using Intel® PTU v 4 update 5 as one of the events may be missing. It is recommended to use manual collection or collection via text based utility in this case.

14) While analyzing the data using xiflauncher, it gives error: “tb5 directory not found”. In that case create a tb5 folder inside working directory and copy the tb5 files. Xiflauncher handles strict case of run on Windows* and is just a learning mechanism.

Papers/blogs on Intel® PBA Support for Latest Architectures

Intel® Performance Bottleneck Analyzer has added additional support for the Intel® Atom™

processor and 2nd Generation Core™ architecture analysis.

See blogs written on short call-ret finder and zero length call finder for Intel® Atom™

architecture at

http://software.intel.com/en-us/blogs/2010/10/25/zero-length-calls-can-tank-atom-processor-

performance/

http://software.intel.com/en-us/blogs/2010/10/12/avoid-short-functions-on-atom/

For 2nd Generation Core™ processor support case studies, see optimization guide appendix B

(using performance monitoring events – sub-section 3)

http://www.intel.com/Assets/PDF/manual/248966.pdf

Load breakdown using precise load retired events is described at blog

http://origin-software.intel.com/en-us/blogs/2010/09/30/utilizing-performance-monitoring-

events-to-find-problematic-loads-due-to-latency-in-the-memory-hierarchy/

Using load latency to estimate line fill buffer breakdown is described at

http://software.intel.com/en-us/blogs/2010/11/11/utilizing-load-latency-event-in-performance-

monitoring-to-get-line-fill-buffer-breakdown/

27

Meet the Intel® Performance Bottleneck Analyzer Design Team Rajshree Chabukswar

Architect

Templates

Event data

Issue finders

Module/function/thread granularities

Intel® VTune™ Amplifier XE backend integration

Load latency

Power analysis

Mike Chynoweth

Architect

Line/Hyperblock/Block/Stream/Loop/Spike granularities

Issue finders

Issue object layer

Intel® 2nd Generation Core™ support

Intel® Atom™ Support

Jun De Vega

Power analysis

Issue finders

Eli Hernandez

Issue Finders

Charlie Hewett

Command line

LBR infrastructure

Architecture layer

Observations

Seung-Woo Kim

Issue Finders

Intel® PBA Reporting GUI

GUI database

Petter Larsson

Intel® Atom™ support

Issue Finders

George Lin

Issue Finders

Power analysis

Lynn Merrill

28

Intel® Atom™ support

Issue Finders

Erik Niemeyer

Architect

Architecture layer

Data access layer/Dicer

Issue finders

XED Disassembler layer

Logging layer

Intel® PBA Reporting GUI

Database layer

Source control

Intel® Atom™ Support

Text-based Launcher

Peter Nee

Load latency

Intel® AVX load support

Intel® 2nd Generation Core™ finders

SIMD partial register stall finder

Joe Olivas

Linux*

Mac* OS X

Intel® IACA support

Intel® VTune™ Amplifier XE backend integration

Chris Phlipot

Competitive analysis

Intel® PBA Reporting GUI

Bucketing layer

Intel® PBA production support

Manuj Sabharwal

Multiplexing support

PTU Integration

Scripts

Intel® Core™ i7 support

Vladimir Tsymbal

IP2SYM Symbol resolution

29

Acknowledgements

Many thanks to Sampling Collector team, Intel® VTune™ Amplifier XE backend integration and

Intel® IACA development teams who helped in resolving issues for integrating with Intel®

Performance Bottleneck Analyzer

Sampling Collector – Shobha Ranganathan, Vishnu Naikawadi, Bhanu Shankar

Intel® VTune™ Amplifier XE Backend – Tony Mongkolsmai, Alexei Alexandrov, Anna Malashkina,

Lee Baugh, Douglas Armstrong, Anton Yefimov

Intel® PTU – Julia Fedorova, David Levinthal, Dmitry Bazhin , Alexey Bukhnin, Anastasya

Vladimirova, Iliya Grachev

Intel® Architecture team – Ahmad Yasin

Intel® IACA – Israel Hirsh, Tal Uliel

Bugs

Please submit bugs to the whatif site on PBA.