Intel Performance Bottleneck Analyzer User’s Guide · PDF fileIntel® Performance...
Transcript of Intel Performance Bottleneck Analyzer User’s Guide · PDF fileIntel® Performance...
2
Contents
Quick Setup Instructions for Intel® Performance Bottleneck Analyzer Package ............................. 3
Prerequisites: ............................................................................................................................... 3
Installing Intel® Performance Bottleneck Analyzer ................................................................. 3
Collecting TB5 files and Analyzing Data with Intel® PBA ......................................................... 4
Introduction to the Framework ..................................................................................................... 12
Intel® Performance Bottleneck Analyzer Granularity Definitions .................................................. 14
Intel® Performance Bottleneck Analyzer Capabilities .................................................................... 17
Intel® Performance Bottleneck Analyzer Report GUI ................................................................ 17
Latest Architecture Support ...................................................................................................... 21
LBR Support ............................................................................................................................... 21
TB5 Database Backend .............................................................................................................. 21
Linux and Mac Parsing Support on Windows ............................................................................ 21
Load Latency .............................................................................................................................. 21
Multiplexing Support ................................................................................................................. 22
Slow Frames Analysis ................................................................................................................ 22
Power Correlation Capability .................................................................................................... 23
Templates .................................................................................................................................. 23
Top Down Counter Analysis Using “Observations” ................................................................... 24
Known Issues ................................................................................................................................. 25
Papers/blogs on Intel® PBA Support for Latest Architectures ...................................................... 26
Meet the Intel® Performance Bottleneck Analyzer Design Team .................................................. 27
Acknowledgements ....................................................................................................................... 29
Bugs ............................................................................................................................................... 29
3
Quick Setup Instructions for Intel® Performance Bottleneck Analyzer Package
Prerequisites:
- Java run time (JRE) version 6 Update 10 or greater http://www.java.com/en/download/manual.jsp
- Intel® VTune™ Amplifier XE – needed for parsing sampling collector data http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/ Failure to install this will cause ‘valid license not found’ error during analysis
stage - Sampling Collector – for data collection
Included with the Intel® Performance Tuning Utility (Intel® PTU) v 4 update 5 or later package
http://software.intel.com/en-us/articles/intel-performance-tuning-utility/
Note: To install the Sampling Collector from Intel® PTU, open a command window in an Administrator mode and run “sepreg.exe -i” from the <ptu_home>\bin\ directory.
System reboot is required if this is run for the first time on the system. No reboot for subsequent installs/updates.
See 'SEP_Install_Instructions' located in <PBA_home>\docs folder for more detailed installation instructions.
Note: It is required to collect programmable clockticks event (CPU_CLK_UNHALTED.THREAD_P/ CPU_CLK_UNHALTED.CORE_P) for multiplexed dataset (already included in the scripts for PBA)
Installing Intel® Performance Bottleneck Analyzer
1. Install Intel® PBA by extracting the downloaded zip file on your system, having spaces in the
directory name where PBA installed causes crashes while data collection.
2. Navigate to find the xIF.lnk shortcut in the directory where the zip file is extracted and open
it to launch the Intel® PBA command window.
Note: Do not open this link with administrative privileges as this will change your base
directory and cause environment to not get set correctly.
3. Set the Sampling Collector run time environment before starting to collect sampling data
a. Navigate to where the Sampling Collector is installed in Intel® PTU. Typically it
should be located here: <ptu_home>\bin
b. Run ‘setup_sep_runtime_env.cmd’ to set the Sampling Collector environment
4. Navigate to the desired working directory. Then, for standard Intel® PBA analysis, see the
‘Intel® PBA text-based assistance utility’ section below to get started.
Note: If Intel® VTune ™ Amplifier XE cannot be installed on the system for any reason, the valid
license file purchased for the product should be copied in the path so that Intel® PBA can run
4
successfully (e.g. copy the license file at C:\Program Files\Common Files\Intel\Licenses. For 64-
bit systems use ‘Program Files (x86)’ folder)
Collecting TB5 files and Analyzing Data with Intel® PBA
Collecting and analyzing data can be done in 3 ways:
Intel® PBA text-based assistance utility
o Contains simple prompts for new users to learn the flow of collection and
analysis
Intel® PTU graphical interface
o Intel® PBA has integrated with other tools for ease of collection and analysis
Collections directly using the Sampling Collector
o For more advanced users, Intel® PBA offers options to collect directly using the
Sampling Collector
NOTE: While collecting data on Intel® Core™ i7 and Intel® Core™ i7 Extreme architectures, user
may see an error “Invalid Event Discarded: event_name”. Continue with data collection as it will
still collect valid data. This happens to due change on event names on processors.
Intel® PBA text-based assistance utility
The text-based assistance utility, ‘xiflauncher’, helps collect, analyze and view data for new
users. This utility is meant to teach the flow of analysis to new users but experienced users will
most likely want to switch to collecting and analyzing manually for more flexibility.
1. Collect data using xiflauncher
a. Run ‘xiflauncher’ command at the command prompt
b. Setup Sampling Collector paths as mentioned above using
setup_sep_runtime_env.cmd
c. You will see options to collect data/ analyze data/ view data
d. Select option to ‘Collect Data’
e. Follow the instructions on the screen to collect the data
f. We recommend that Last Branch Record (LBR) and multiplexing data to be collected
for the architecture under analysis
2. Analyze data using xiflauncher
a. Select option to ‘Analyze Data’
b. Follow instructions on the screen to analyze the data. Since, Xiflauncher only
handles strict case of run on Windows*, it assumes that all tb5 are in tb5 directory
of working directory.
c. Since Intel® PBA assumes that user knows the module of interest, if it is not known
run the analysis on application exe and check the ‘module’ tab when results are
opened to double check most time consuming module.
5
3. View data using xiflauncher
a. Once the Analysis is complete, you can close the analysis command prompt that is
opened by the previous step
b. Select option to ‘View Data’
c. This will open the Intel® PBA GUI to the default Module selection screen. See the
‘Intel® Performance Bottleneck Analyzer Report GUI’ section for more on how to
navigate through the data.
Collecting and analyzing data using Intel® PTU
Users can use Intel® PTU to collect and analyze Intel® PBA samples.
Collect Intel® PBA data:
To collect Intel® PBA samples open eclipse from <ptu_home>\eclipse directory. Navigate to
preference in Windows -> Preference. Add Intel® PBA report location to the path and click OK as
shown in figure 1 below.
Figure 1 Intel(R) PTU Preference Window
Create a new project by navigating to the ‘New Project’ in the File Menu.
In a new project wizard select New Intel® PTU project.
6
Figure 2 Intel(R) New Project Wizard
Enter the project name and click next. In workloads details window, enter profile information to
be collected and click finished. This will create a new project on your tuning navigator windows.
Right click on your project -> Profile As and select Intel® PBA Client Profiling to collect event
samples.
7
Figure 3 Intel(R) PBA Profile Selection
Analyzing data using Intel® PTU-PBA tool
After completion of collections, right click in output window and select Intel® PBA Report. This
will run Intel® PBA and open the GUI with event analysis.
Figure 4 Intel(R) PBA GUI Report
8
Collecting Sampling Data for Intel® PBA Directly Using the Sampling Collector
This section takes you through the process of collecting the profile data. It gives you enough
information to begin analyzing the sampling data using Intel® PBA framework on recent Intel®
architectures.
There are two modes of full data collection for Intel® PBA full analysis:
- Multiplexing Mode (Recommended): Collects all events required in a couple of runs.
- Single Collection Mode (Not Recommended): Collects one set of events per run over
numerous runs. Intel® PBA must have at least the clockticks and instructions retired
events for the platform to run basic analysis.
Multiplexing Mode
We utilize multiplexing as a default mode to collect necessary data. The following steps will
setup your Sampling Collector with multiplexing collection.
First, set up your Sampling Collector environment for collection. Utilize the multiplexing scripts
we have provided to ensure the correct Sampling Collector data is collected. We maintain
support for Intel® Core™ i7 microarchitecture, Intel® 2nd Generation Core™ Processor and
Intel® Atom™ multiplexing scripts in the Intel® PBA Framework. Script files are placed under
<PBA>\Scripts\Multiplexing directory.
- NHM_runmp1.cmd: Collects critical and high priority events required on Intel® Core™
i7processors.
- WSM_runmp1.cmd: Collects critical and high priority events required on Intel® Core™
i7 Extreme processors.
- SNB_runmp1.cmd: Collects critical and high priority events required on Intel® 2nd
Generation Core™ processors.
- BNL_runmp.cmd: Collects critical and high priority events required on Intel® Atom™
processors
- NHM_runlbr.cmd/WSM_runlbr.cmd/SNB_runlbr.cmd/BNL_runlbr.cmd: Collects Last
Branch Record (LBR) data
- NHM_loadlatency.cmd/WSM_loadlatency.cmd/BNL_loadlatency.cmd: Collects load
latency data
- SNB_runPDIR.cmd: Collects precise instructions retired for a more accurate basic block
hit count.
For more details on Last Branch Record and Load Latency, see ‘Intel® Performance Bottleneck
Analyzer Capabilities’ section.
The script takes two arguments, the duration of collection time and tb5 filename
All the multiplexing scripts will start Sampling Collector in non-pause mode which means the
collection will start immediately.
9
Note: You can add –sp option to start the collection in pause mode in the multiplexing
script.
After you have finish setting up your system with correct multiplexing script and database file,
run the multiplexing script from command prompt in an administrator mode for 1-4 minutes
depending on the CPU utilization. Example: If an application workload has 100% of CPU usage
you can run multiplexing script for 80-120 seconds. If an application workload has 25%-50% of
CPU usage you must run multiplexing script greater than 180 seconds for accurate profiling.
Example – NHM_runmp1.cmd 60 abc.tb5
Where 60 is duration is seconds while abc.tb5 is the output file name.
Single Collection Mode:
In single collection mode, we collect basic events – Clock-ticks and instruction retired. To collect
basic events, user can run sep –start in <ptu>/bin directory and later analyze using xiflauncher.
Figure 5: steps to analyze data
Analyze data using ‘scripts’ in Intel® PBA
10
Intel® PBA provides sample scripts which can be used to run the most typical types of analyses
at various granularities. The scripts can be found in the PBA_ROOT\scripts folder and they are
detailed below.
Full Analysis
Script Name: full_analysis.cmd
This will run full issue analysis along with streams, spikes, loops in multiplexing and non-
multiplexing mode. This is the most common type of analysis run in Intel® PBA
Usage
full_analysis.cmd <module_name> <tb5_dir_name> <SNB\MRM\PNR\NHM\BNL>
Example: full_analysis.cmd abc.exe myTb5Dir SNB
This command will generate a “db” directory which contains a database of output which is read
by the Intel® PBA GUI to view the results.
To open .xif files, copy run_gui.cmd file from <PBA>\scripts\
Example: run_gui.cmd
Go to “File->Open” and select results.xif file. Please refer to the “Intel® Performance Bottleneck
Analyzer Report GUI” section for additional information on how to use the GUI to navigate
through the data.
Compare Architectures with Full Analysis
Script Name: compare_full_analysis.cmd
This will run comparison analysis between two architectures including issue analysis along with
streams, loops, spikes.
Usage
compare_full_analysis.cmd <module_name> <tb5_dir_name_arch1> <tb5_dir_name_arch2>
<SNB\MRM\PNR\NHM\BNL> <SANDYBRIDGE\MEROM\PENRYN\CORE™ I7\BONNELL>
Example: compare_full_analysis.cmd abc.exe myTb5Dir myTb5Dir2 SNB CORE™ I7
The first architecture parameter (and first tb5 directory name) is for the primary architecture
under evaluation. All of the issues identified will be for the primary architecture. The second
architecture parameter is for the competitive architecture. This can be useful to compare
against data analyzing the same binary but running on a different architecture or configuration
11
(e.g. Intel® Core™ i7 vs. Intel® 2nd Generation Core™ , SMT off vs. SMT on, different cache
sizes, etc.)
This analysis creates the same files as described above along with the secondary architecture
information populated as comparison to the primary architecture. Use the Viewer GUI to open
.xif files as described above and see the “Intel® Performance Bottleneck Analyzer Report GUI”
sub-section for additional information on how to view the results.
Additional Options of Interest
There are some additional options which can be added to the command line to enable different
features of Intel® PBA. Note that this is an advanced usage of Intel® PBA, so support for issues
encountered is limited. Below is an example of how to add an option to your analysis.
Usage example from full_analysis.cmd
Add an option (highlighted) to the command line.
java -Xmx1300m xIFJava.Main -f "%~1" -jdir "%~2" -%3 -csvoutput
Here are a few options which can be added/modified which may be useful.
1. -csvoutput
a. This option will output various csv files of data similar to what is displayed in the
GUI. An example of this is fuction_overview.csv which contains function
hotspots and counter data. Note that the preferred usage is to view data in the
GUI, but there are also some data in the csv’s that are currently unavailable in
the GUI. One example is multilineissues.csv file which contains the full assembly
for issues which span multiple Instruction Pointers such as store forward
blocked.
2. -Xmx<number of MB for Java heap>m
a. Note that this Java™ option is already set at 1300 in the sample scripts, but may
need to be increased for large datasets when additional memory is available.
12
Introduction to the Framework Intel® Performance Bottleneck Analyzer (aka xIF = x86/64 Issue Finder) was written to provide a
framework which can be used to identify and prioritize issues specific to current and future
Intel® architectures. The framework accomplishes this goal through the following capabilities:
Static assembly analysis
- An executable, shared library, or object file is disassembled and analyzed for known
issues in Intel® architecture
- Event based profiling must be present so issues found can be prioritized utilizing
clockticks and hit counts
- Assembly latencies are provided for the following architectures:
Intel® 2nd Generation Core™ , Intel® Atom™, Intel® Core™ i7, Intel® Core™ 2
Duo.
Correlation of event based sampling and assembly
- Utilize event based sampling to reconstruct instruction stream of execution as it ran
on the CPU in nanosecond time
- Correlate issues found with events with knowledge of the surrounding assembly at
any granularity
- Embeds any events collected alongside execution profile
Comparison between architectures
- Architectures covered:
Intel® 2nd Generation Core™ , Intel® Atom™, Core™ i7, Intel® Core™ 2 Duo
- Viewing data at granularities of module, function, instruction stream, and basic
block
- View execution profile of various architectures with events and assembly latencies
embedded
Below is a diagram of the basic flow of analysis
14
Intel® Performance Bottleneck Analyzer Granularity Definitions Basic Block
Intel® Performance Bottleneck Analyzer determines where the basic-block boundaries exist
based off of where branches and entry points exist within the code as well as utilizing data from
the LBR (last branch records). The formal definition of a basic block is any assembly code that
will run continuously without branching. All instructions within a basic block will have the same
execution frequency. Analysis at the basic block level negates some of the effects due to skid
and allows us to look at reproducible data on an application. Tool also looks at the all calls and
indirect branches to indentify entry points. Each entry point is seen as a start of a new basic
block boundary as well.
Basic Block Example
lea edx,ds:[300A2CB8h] //Clock% 0.01
neg ax //Clock% 0.4
mov bl,al //Clock% 0.2
and ebx,0Fh //Clock% 0.9
je 30C83658 //Clock% 1.3
Total_Block_Clock = 2.81% total process clockticks
Spike Explanation
The spike object is anywhere in the profile where tool sees a push-out in the retirement of
instructions signifying a bottleneck in the pipeline. We use a simple heuristic to determine
where spikes occur in our streams and mark them so we can attempt to explain the bottleneck
with events and static assembly analysis. The spike object undergoes more analysis than other
portions of the application. Spikes usually occur on the instruction after the bottleneck occurs
due to a typical IP+1 (IP = Instruction Pointer) on cost of a major bottleneck in sampling. The
example below shows an example of three spikes identified in a stream.
15
Stream and Loops Explanation
The streams granularity represents common paths of execution in the application. They are
created through utilizing LBR (last branch record) data and basic block hit counts to recreate the
most common paths of execution on the chip. In the absence LBR data tool will still create the
streams granularity but has to make some dangerous assumptions on the path of execution. A
loop is just a specialized stream that starts at the head of the loop and ends at the tail of the
loop. The streams granularity is fundamental to studies because it accomplishes 3 goals:
1) Gives the performance engineer the context of the performance issue
2) Allows the engine to analyze a much larger window of code generation for
analysis
3) Helps account for skid
Example of a flow of execution of a stream:
30aa668a: E8 68 call 30aa63f7 //Jump to address 30aa63f7
30aa63f7: 8B 0F mov ecx,dword ptr [edi]
16
30aa63ff: C1 FB 18 sar ebx,18h
30aa6403: 8B F0 mov esi,eax
30aa6405: 0F 88 CD js 30AA64D8 //Fall thru to next IP
30aa640b: 0F B7 47 04 movzx eax,word ptr [edi+4]
30aa640f: 66 3D FE FF cmp ax,0FFFEh
30aa6413: 74 66 je 30AA647B //Jump to address 30aa647b
30aa647b: 0F B7 movzx eax,word ptr [edi+4]
HyperBlock Explanation
A HyperBlock is a set of instructions presented in retirement order similar to a stream. In a
HyperBlock no other path of execution significantly jumps out of the middle of a HyperBlock or
into the middle of the HyperBlock. All instructions within a HyperBlock will have the same
execution frequency. The stream granularity explained above is essentially a chain of
HyperBlocks in the most typical order that they were executed. The HyperBlock granularity was
created for the following reasons:
1) Code frequency statistics are more accurate since a typical HyperBlock is much
larger than a basic block.
2) Branch statistics on every basic block was too much information so the
HyperBlock granularity helps make it readable.
3) Feeding paths of execution that run at the same frequency was necessary to
interoperate with assembly analyzers such as Intel® Architecture Code Analyzer
(IACA).
17
Intel® Performance Bottleneck Analyzer Capabilities Intel® PBA is designed, written and maintained by performance engineers who are actively
engaged with software vendors. Every feature written for Intel® PBA has been used to study
and identify performance opportunities for our ISV customers. The framework was designed to
utilize knowledge of processor events and static assembly analysis to automatically explain
performance bottlenecks. The bottlenecks which cannot be explained are prioritized and tagged
for further analysis. The team has been analyzing our own applications with the Intel® PBA
framework on Intel® 2nd Generation Core™ and Intel® Atom™ for ~1 year. The toolset is now
aware of over ~100 events on the Intel® 2nd Generation Core™ architecture and can associate
those events with patterns it finds with its static assembly analysis.
Intel® PBA has the following new components:
Intel® Performance Bottleneck Analyzer Report GUI The GUI has been designed by Seung-Woo and Erik. Intel® PBA has a GUI to look at the analysis
from many different granularities and link them to issues and other insights that tool provides.
Intel® PBA outputs a set of directories containing the database for the analysis. The root
directory for the output is always in the “db” directory where xIFJava was run (Intel® PBA
analyzes one module at a time). You can run analysis on several modules in multiple runs. All the
analysis will be merged and stored in the “db” directory.
18
GUI Overview
0. File->Open: Opens the results of the analysis.
a. The result of analysis is saved to
results.xif and “db” directory in the same
folder.
b. When multiple analyses are performed
on multiple modules, the results are
merged in the “db” directory.
c. Opening results.xif will load all the
merged results of the analyses.
1. Options: sets
the default
behavior of
GUI.
a. Percent View: toggles percent vs. event
count in GUI.
b. Stream / Loop View: shows either
streams or loops.
c. Domain (x) axis: changes x axis in the
chart.
2. Chart: Shows the data in the chart form. The
chart currently showing is determined by
“Selection Granularity” in (5).
3. Events: Shows the list of events to display.
a. TB5: raw Sampling Collector events.
b. Costs In Domain: Ratios defined in the
template.
c. Observation: High level stats.
19
4. Module Summary: various stats for the selected
module. Chart and Selection Granularity table
shows the data from stream / loop’s perspective.
The tables here show the collection of different
objects and granularities grouped by the same
kind.
a. Spikes: shows the list of all spikes for the
selected module. Moving the cursor on
the “Explained%”, “Load BreakDown”, or
“LFB Breakdown” shows further
information in the tooltip.
b. Right clicking on any row in the tables
pops up the streams / loops containing
the selected object.
c. Other entities show their own collections.
The information varies depending on the
objects and the events selected.
5. Selection Granularity: selects the granularity of the objects to analyze. They are arranged in
hierarchical order – Module contains Stream / Loop, Stream / Loops contains HyperBlock,
HyperBlock contains Basic Block, etc. This controls what’s showing in the chart area in (2) as
well.
a. Module: Shows the list of modules and their associated data selected in the events
window in (3).
i. In the above figure, only the clockticks is selected for Intel® 2nd Generation
Core architecture in (3).
ii. xxx.exe is consuming 99.16% of all clockticks.
iii. Selecting xxx.exe in the chart automatically selects the same in the module
table and vice versa.
iv. Double clicking the bar in the chart (xxx.exe) will drill down to stream
selection view for the module. Or you can select “xxx.exe” and click on
“Stream” tab in the selection granularity window.
b. Stream: Shows the list of streams and their associated data selected in the events
window in (3).
i. Selection on the chart and table works the same way as in module.
20
ii. Double clicking a stream in the chart will switch the view to the instruction
line level, though.
iii. Actually HyperBlock, BasicBlock, and InstructionLine granularity will show
the same graph in the chart. However, the selection granularity in the chart
will be different. For example, there are 6 lines in the corresponding
hyperblock.
c. HyperBlock / Basic Block / Instruction Line: displays the sequence of instructions
along with the associated data selected in the events window (3).
i. In the upper right corner of the chart, the overall issues found in the
currently selected stream are displayed along with the impact of the
individual issues to the stream.
ii. The big circles in the chart are the spikes. Small circles around the spike
show the specific type of information for the spike. Clicking on the big circle
cycles through the information on the label. You can drag the label around
for the better viewing.
21
Latest Architecture Support The tool supports latest and greatest Intel® architectures such as Intel® 2nd Generation Core™
and Intel® Atom™. We have been working directly with architecture teams to put in a lot of
support for architecture specific bottlenecks. See Intel® Atom™ and Intel® 2nd Generation
Core™ studies section below for more details.
LBR Support Charlie Hewett has tackled grabbing full LBR (last branch record) data. This functionality results
in more accurate reproductions of path of execution, accurate hit counts on basic blocks and
capability to output branch statistics. This also helps us create our new base granularity named
HyperBlocks. In the near future we will use this to produce a statistical call graph. LBR support
is only available on the Intel® Core™ i7, Intel® 2nd Generation Core™ and Intel® Atom™
architectures. Precise instructions retired (PDIR) is also collected to assist with basic block hit
count on Intel® 2nd Generation Core™ architecture.
TB5 Database Backend Rajshree, Joe and Erik tackled this to interface directly with the Intel® VTune™ Amplifier XE team
to integrate their backend database creating a data access layer to feed counter data to the tool
engine.
Linux and Mac Parsing Support on Windows Joe has spent a lot of time ensuring that Linux and Mac analysis work. Sampling Collector data
collected on either a Linux* or Mac* OS X system can be moved to a system running Intel®
Performance Bottleneck Analyzer and analyzed as usual. Simply copy the TB5 data set, along
with the necessary binaries (with or without symbols stripped) and follow the steps outlined in
this document.
For OS X*, the Intel® Performance Bottleneck Analyzer only recognizes the Mach-O binary file,
rather than the .app package. The binary for such a package can often be found inside under
MyApp.app/Contents/MacOS. In addition, Sampling Collector on OS X is still under
development, and may not be immediately available at the time of this Intel® Performance
Bottleneck Analyzer release.
The current release of Intel® Performance Bottleneck Analyzer does not support running on
Linux* and OS X* at this time. Please contact the Intel® PBA development team for more
information and updates.
Load Latency Load latency event is now being used to breakdown the dreaded LFB (line fill buffer) source that
can incur any latency. This feature implemented by Peter and Rajshree has helped us also
determine whether Intel® AVX loads are missing in the L1D or not. Load latency is only available
on Intel® Core™ i7, Extreme and Intel® 2nd Generation Core™ architectures.
22
Load Latency data provides additional information on load cost in two cases where the precise
load events are ambiguous:
There is a known issue with the precise events and Intel® Advance Vector Extension
(Intel® AVX) 256 bit loads. For these loads the precise events data will look as if loads
are always satisfied from L1D or Line Fill Buffer (LFB). The tool can use load latency data
to substitute a more accurate breakdown of probably sources based on load latency.
For all loads from LFB, the actual cost can be highly variable. The tool can use load
latency data to provide a supplemental breakdown of likely sources of data loaded from
LFB, to provide a better picture of the overall cost of a load with a large share of LFB
samples.
Note: Due to truncation of %s to 1 decimal point in GUI, we may see a case where 0.0%
shows coming from LFB using load breakdown, but still shows LBF breakdown when load
latency counter is collected.
Multiplexing Support The tool is moving to accomplish a full analysis off of a single run. Each command line and shell
scripts for accomplishing a multiplexed run on the architecture are included in our scripts folder.
Multiplexing support has been thoroughly tested on non-steady state workloads due to some
great engineering work by Manuj Sabharwal. Several post processing checks have been
implemented to ensure that the multiplexing data is representative of the entire run. The only
draw backs today is that LBR data, load latency data and Precise Instructions Retired (PDIR)
cannot be collected with multiplexing and will need to be collected in separate runs. The LBR
and load latency runs are not required for analysis but we recommend running them.
Slow Frames Analysis Slow framerate profiling was created to compare data from slow frames with data from fast
frames and determine everything that is different between the two data sets.
This is accomplished by instrumenting the application binary with calls to record the time stamp
counter (tsc) before and after the frame is started and finished. The delta is then checked
against user input and recorded if the value is greater than the time required to hit the
minimum framerate specified.
The times of the start and end of the slow frames is put into a csv file which is read by the tool in
order to perform the analysis between the two data sets
A follow up blog will be published explain usage of this feature in detail
23
Power Correlation Capability Intel® Performance Bottleneck Analyzer provides a way to correlate CPU power data from
NetDAQ* analysis (CSV file created by NetDAQ*) with the performance data collected by
Sampling Collector. Power and performance data is collected at the same time for any workload.
The correlation is then achieved by using time stamp events from the NetDAQ* data and
Sampling Collector tb5 files. Using the CPU power data from NetDAQ*, tool creates 2 bins of
data for comparison. High power bin is setup at upper 20% of power limit from CPU power. Rest
becomes low power bin. Intel® PBA then compares these power bins to identify hot modules
and functions which are most active in high power area as compared to low power area.
Here is what we get in module view for an ISV workload
The data indicates that at module level, xxx.dll spends more time in high power bin.
Further drilling down to function level indicated that a spin wait loop is more active in high
power bin. Discussion with ISV indicated that __pause was not used in the spin wait loop.
Thanks to Rajshree, George and Jun for adding this functionality.
Templates Templates provide a way to customize issues and ratios that user want to see at run-time.
Templates also provide a way to override cost of any architectural issue. If the costs are not
overridden in templates, the default cost per architecture are achieved from arch layer. These
are populated per architecture in a CSV file stored in PBA_ROOT\templates folder.
For Intel® 2nd Generation Core™ architecture the template file includes ratios called
‘OBSERVATIONS’. Observations are way to determine what stage of the pipeline the code is
bottlenecked on. We have 4 high level observations: FrontEnd (and further breakdown of how
many uops are delivered by FrontEnd per cycle), BackEnd, BadSpeculation (mispredicted uops)
and Retiring. We have also included sub-categories inside these top 4 categories to identify
stages of the pipeline those are bottlenecks. Observations don’t have any cost associated with
them since these are not actual issues, but indication of issues. Percentages associated with
observation factors provides an indication at higher level on what is primary bottleneck at each
object granularity which can be used to zoom in further on issues.
We are currently working on breaking those higher observations down further to zoom into
each part of the pipeline. Currently these are not in any hierarchical order; this will be included
in future releases.
24
How to add an issue to templates
1. Open required architecture’s template CSV file
2. Add issue name, rule type (e.g. dynamic) and issue description
3. Add numerator and denominator events that would be used to calculate ratios for dynamic
issues in the (). For these events, we have support for event math such as (eventA-
eventB+eventC).
4. MulFactor is the static cost for the issue as per architecture selected. E.g. cost of LLC_MISS
on Intel® Core™ i7 is typically ~200 clocks (which is stored in arch layer by default). But if
your calculated cost run shows LLC_MISS as 180 clocks instead, you can change the
MulFactor in the templates CSV file which will automatically override the default cost. This
helps customize the issue impact per application.
Currently ratio_high and ratio_low are same values. These indicate the threshold to
check for in your application before tagging an issue to it. E.g. if
LLC_MISS_CounterName/Clockticks*200 > 0.05, then we have LLC_MISS issue.
5. Non-Supported Objects column provide a way to exclude any object granularity from
applying the specified rule. E.g. if we want to apply a rule only for loop level analysis, we can
exclude other object types such as streams, blocks, modules, functions from specific rule by
adding the object types here. The valid object types are: ModuleData, FunctionData,
LoopData, StreamData, SpikeData, BlockData
6. PriorityFactor indicates the priority of the issue within the hierarchy that we maintain. This
is based on how much the issue would typically cost if this is found. We use priority factor to
provide appropriate weight to multiple issues found at an object granularity so we can
account for issues found like LLC_MISS before putting any weight on front end issues.
Top Down Counter Analysis Using “Observations” We have added initial capability to perform top down counter analysis on Intel® 2nd Generation
Core™ by enabling “Observations”. Observations help to analyze CPU execution at a high level
and then drill down in a structured manner to identify the true bottleneck(s). Charlie Hewett
has been working with Ahmad Yasin to get the ratios defined in our templates file.
25
Known Issues 1) Intel® Performance Bottleneck Analyzer only supports client workloads and processors.
Analysis can be attempted for HPC or server workloads and cores but is not supported.
2) Analysis of managed code (e.g. Java or C#) with Intel® Performance Bottleneck Analyzer is
not supported at this time
3) Analysis on the Linux kernel cannot be fully trusted from Intel® PBA. The toolset has a
known issue of dropping samples from the kernel analysis which will be fixed in the next
revision of the tool. The tool will output a warning after analysis indicating that samples
have been dropped.
4) If you overwrite newly collected tb5 file with same name as old tb5 file, make sure to delete
the temp_data folder. Since we only check tb5 file names at the moment and not time\date
stamps.
5) Statically found issues may get higher cost based on hit count
6) Load latency data only available at spike level for instruction before pinnacle of the spike
7) LBR data may have invalid addresses on certain platforms and configurations. We have
been debugging what we believe to be a firmware issue on single socket Intel® Core™ i7
Extreme based platforms where the addresses returned are bogus. We have implemented a
check into Intel® PBA and in the scenario that the addresses are invalid, you will see the
following message to the console output and Intel® PBA execution will continue without LBR
data:
WARNING: Total number of taken branch (i.e. usable) Lbr
samples was: 0
Execution will continue, but LBR analysis will not be
available.
If you see this issue, you may need to collect data on Intel® Core™ i7 or Intel® 2nd
Generation Core™ or Intel® Atom™ to get valid LBR data.
8) GUI only displays top 20 streams, loops, modules in the bar graph. But the table below
contains entire list
9) On Intel® 2nd Generation Core™ observations, events haven’t been fully validated under
SMT case.
10) When running comparison analysis on full dataset (LBR, load latency) on two architectures,
it is strongly recommended to run on system with 4GB of memory and increase the Java
heap size to 3GB instead of default 1.3GB by editing the compare_full_analysis.cmd file. A
crash is likely to happen with 1.3GB heap option.
26
11) For Windows* XP only, please install the re-distributable below
1. Microsoft Visual Studio* 2005 SP1 redistributables 2. Microsoft Visual Studio* 2008 redistributables
12) When resolving or copying the binaries and/or symbols, Intel® PBA Launcher can give Error: “Module does not exist”. Possible issue is windows user access control blocking the access.
13) On certain Intel® Core i7 Extreme processors, load break down may not give correct information when the data is collected using Intel® PTU v 4 update 5 as one of the events may be missing. It is recommended to use manual collection or collection via text based utility in this case.
14) While analyzing the data using xiflauncher, it gives error: “tb5 directory not found”. In that case create a tb5 folder inside working directory and copy the tb5 files. Xiflauncher handles strict case of run on Windows* and is just a learning mechanism.
Papers/blogs on Intel® PBA Support for Latest Architectures
Intel® Performance Bottleneck Analyzer has added additional support for the Intel® Atom™
processor and 2nd Generation Core™ architecture analysis.
See blogs written on short call-ret finder and zero length call finder for Intel® Atom™
architecture at
http://software.intel.com/en-us/blogs/2010/10/25/zero-length-calls-can-tank-atom-processor-
performance/
http://software.intel.com/en-us/blogs/2010/10/12/avoid-short-functions-on-atom/
For 2nd Generation Core™ processor support case studies, see optimization guide appendix B
(using performance monitoring events – sub-section 3)
http://www.intel.com/Assets/PDF/manual/248966.pdf
Load breakdown using precise load retired events is described at blog
http://origin-software.intel.com/en-us/blogs/2010/09/30/utilizing-performance-monitoring-
events-to-find-problematic-loads-due-to-latency-in-the-memory-hierarchy/
Using load latency to estimate line fill buffer breakdown is described at
http://software.intel.com/en-us/blogs/2010/11/11/utilizing-load-latency-event-in-performance-
monitoring-to-get-line-fill-buffer-breakdown/
27
Meet the Intel® Performance Bottleneck Analyzer Design Team Rajshree Chabukswar
Architect
Templates
Event data
Issue finders
Module/function/thread granularities
Intel® VTune™ Amplifier XE backend integration
Load latency
Power analysis
Mike Chynoweth
Architect
Line/Hyperblock/Block/Stream/Loop/Spike granularities
Issue finders
Issue object layer
Intel® 2nd Generation Core™ support
Intel® Atom™ Support
Jun De Vega
Power analysis
Issue finders
Eli Hernandez
Issue Finders
Charlie Hewett
Command line
LBR infrastructure
Architecture layer
Observations
Seung-Woo Kim
Issue Finders
Intel® PBA Reporting GUI
GUI database
Petter Larsson
Intel® Atom™ support
Issue Finders
George Lin
Issue Finders
Power analysis
Lynn Merrill
28
Intel® Atom™ support
Issue Finders
Erik Niemeyer
Architect
Architecture layer
Data access layer/Dicer
Issue finders
XED Disassembler layer
Logging layer
Intel® PBA Reporting GUI
Database layer
Source control
Intel® Atom™ Support
Text-based Launcher
Peter Nee
Load latency
Intel® AVX load support
Intel® 2nd Generation Core™ finders
SIMD partial register stall finder
Joe Olivas
Linux*
Mac* OS X
Intel® IACA support
Intel® VTune™ Amplifier XE backend integration
Chris Phlipot
Competitive analysis
Intel® PBA Reporting GUI
Bucketing layer
Intel® PBA production support
Manuj Sabharwal
Multiplexing support
PTU Integration
Scripts
Intel® Core™ i7 support
Vladimir Tsymbal
IP2SYM Symbol resolution
29
Acknowledgements
Many thanks to Sampling Collector team, Intel® VTune™ Amplifier XE backend integration and
Intel® IACA development teams who helped in resolving issues for integrating with Intel®
Performance Bottleneck Analyzer
Sampling Collector – Shobha Ranganathan, Vishnu Naikawadi, Bhanu Shankar
Intel® VTune™ Amplifier XE Backend – Tony Mongkolsmai, Alexei Alexandrov, Anna Malashkina,
Lee Baugh, Douglas Armstrong, Anton Yefimov
Intel® PTU – Julia Fedorova, David Levinthal, Dmitry Bazhin , Alexey Bukhnin, Anastasya
Vladimirova, Iliya Grachev
Intel® Architecture team – Ahmad Yasin
Intel® IACA – Israel Hirsh, Tal Uliel
Bugs
Please submit bugs to the whatif site on PBA.