Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.
Transcript of Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.
Performance Counters on Intel® Core™ 2 Duo Xeon®
ProcessorsIntel® Software College
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
2
Objective
At the successful completion of this module, you will be able to
• Use the VTune™ Performance Analyzer to identify micro-architectural bottlenecks in software running on Intel® Core™ 2 Duo Xeon® processors
• Address the performance bottleneck for Intel® Core™ 2 Duo Xeon® processors
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
3
Agenda
Core® micro-architecture review
Event basics
Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks
Summary
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
4
Next Generation Micro ArchitectureIntel® Core™ 2 Duo Processor
FSB
Shared L2 = 4MB
CPU-0Core
CPU-1Core
CPU-0L1D=32KB
CPU-0L1I=32KB
L0/L1 DTLBPMH
CPU-1L1D=32KB
CPU-1L1I=32KB
L0/L1 DTLBPMH
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
5
Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores.
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache/Memory
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
Architecture Block and Instruction Flow
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
6
Agenda
Core® micro-architecture review
Event basics
Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks
Summary
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
7
VTune™ Analyzer Event Basics
Events Versus Samples
A performance counter increments on the CPU every time an event occurs
A sample of the execution context is recorded every time a performance counter overflows
Events = samples * sample after value
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
8
VTune™ Analyzer Event Basics Retired Versus Non-Retired Events
Retired events include only events that occur due to instructions that are committed to the machine state.
• For example, when measuring the Loads Retired event, a load that occurs on a mispredicted execution path is not counted
Most retired events can also be precise events.
• No event skid
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
9
VTune™ Analyzer Event Basics Event Skid
Events can appear a few lines after they actually occur in the disassembly source view, which is due to interrupt latency.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
10
VTune™ Analyzer Event Basics Precise Events
Do not suffer from event skid
Use hardware to record the address where the event occurs
Reduce the number of events you can collect at once
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
11
VTune™ Analyzer Event Basics Precise Events (cont.)
On:
Off:
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
12
VTune™ Analyzer Event Basics Event Ratios
Calculate common processor performance metrics
Built in to VTune™ analyzer
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
13
VTune™ Analyzer Event Basics Clockticks and Instructions Retired
Clockticks measure CPU cycles
Clockticks/processor frequency = time in seconds
Instructions retired = the number of instructions committed to the processor state (executed completely)
Cycles per instruction (CPI) = clockticks / instructions retired
High CPI usually indicates opportunities for optimization.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
14
VTune™ Analyzer Event Basics Clockticks Versus Non-halted Clockticks
Clockticks = halted + non-halted cycles (but no sleep cycles)• The clockticks event measures cycles when the physical processor
is not in any sleep modes.
• The non-halted clockticks event measures the cycles that a logical processor is not asleep or halted.
If you measure clockticks on a Hyper-Threaded technology-enabled system while running a single-threaded application, you will see a lot of samples around the halt instruction in processor.sys.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
15
Agenda
Core® micro-architecture review
Event basics
Performance tuning for Intel® Core™ 2 Duo Xeon® processors
• Events for performance
• Performance optimization methodology
• X86 cycle accounting
Summary
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
16
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Performance Events along µ-op Flow (1)
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache /Memory
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
17
Memory Access (Examples)
• Latencies• L1 miss hits L2 ~ 10 cycles• L2 miss, access to memory ~300 cycles (server/FBD)• L2 miss, access to memory ~165 cycles (Desk/DDR2)
• Cache Bandwidth• Bandwidth to cache ~ 8.5 bytes/cycle
• Memory Bandwidth• Desktop ~ 6 GB/sec/socket (linux*)• Server ~3.5 GB/sec/socket
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
18
Performance Events for the Front EndEVENT P Description EVENT P Description
CPU_CLK_UNHALTED BUS_DRDY_CLOCKS.ALL_AGENTS all busy bus cycles
INST_RETIRED.ANY_P P BUS_DRDY_CLOCKS.THIS_AGENTall busy bus cycles due to writes
INST_RETIRED.LOADS MEM_LOAD_RETIRED.L2_LINE_MISS P L2 demand misses
INST_RETIRED.STORES MMX2_PRE_MISS.T1SW prefetch to L1 inst
BUS_TRANS_ANY all bus transactions MMX2_PRE_MISS.T2SW prefetch to L2 inst
BUS_TRANS_MEM bus trans to memory MMX2_PRE_MISS.STORESNon Temporal Stores executed
BUS_TRANS_BURST whole $lines to mem L2_LINES_IN.SELF.DEMANDL2$lines in for rfo, load, sw prefetch
BUS_TRANS_BRDwhole line reads from mem L2_LINES_IN.SELF.PREFETCH
L2$lines in for hw prefetch
BUS_TRANS_WB writebacks (no NT writes) L2_LINES_OUT.SELF.DEMANDdemanded L2$Lines evicted
BUS_TRANS_RFO$lines in for RFO (no HW pref) L2_LINES_OUT.SELF.PREFETCH
HW prefetch L2$lines evicted
Memory BW = 64*Bus_Trans_Mem*freq/Cpu_Clk_Unhalted
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
19
Lab Activity 1:Calculating the Memory Access Bandwidth
In this lab, you will calculate the bandwidth of memory with the performance counter events using the VTune™ analyzer
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
20
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Performance Events along µ-op Flow (2)
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
Resource_Stalls measures here
transfer from Decode
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
21
Performance Events of Resource _Stallsµ-op flow to OOO engine blocked by downstream causeResource_Stalls.BR_MISS_CLEAR • pipeline stalls due to flushing mispredicted branches• Combine in Resource_stalls.CLEAR• Mispredicted branch followed by fp inst
Resource_Stalls.ROB_FULL• 96 instructions in ROB
Resource_Stalls.LD_ST• All Store or Load buffers in use
Resource_Stalls.RS_FULL• 32 instructions waiting for inputs in Reservation Station
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
22
Measuring Instruction Starvation
There really is no good way to do this• Anti Correlate with Resource_stalls.RS_full
There could be• Cycles Decode queue is empty• Cycles RS is empty• Cycles ROB is empty
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
23
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Performance Events along µ-op Flow (3)
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
Rs_uops_dispatched measures at Execution Other stalls measures at Execution
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
24
Measuring Efficiency in the Execution Stage
OOO engine optimizes instruction issue to functional units from Reservation Station
• They wait there until their inputs are available
• RS_UOPS_DISPATCHED measures number of µ-ops dispatched from RS on each cycle
There are chains preventing OOO engine from executing in parallel
• Partial Register Stall
• Partial Flag Register Stall
• Domain bypass
• Others…
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
25
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Performance Events along µ-op Flow (4)
Retire
Re-Order Buffer (ROB) – 96 entry
IA Register Set
To L2 Cache
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
µ-ops retired measures at Retirement
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
26
Retirement vs Dispatch
Which counters to work on first?
• For loops, difference is due to OOO execution
• Fewer false positives when “Stalls” are measured at Dispatch
• Retirement is generally more important than Dispatch
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
27
Performance Optimization Methodology
This style of optimization has 2 components
• Minimizing instruction count (path length)• A sort of “tree height” minimization
• Minimizing deviations from ideal execution• Generically thought of as “stall cycles”
Treating both equally is critical
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
28
Stalls, Execution Imperfection and Performance Analysis
Stall cycles are used to indicate less than perfect execution• An architectural decomposition of “stalls” can be used to guide the
selection of architectural events• The IP correlation of “stalls” and arch events then guides the
optimization effort
Stalls have 4 basic components in x86• Front End stalls
• Execution stage instruction starvation (Front End)
• Mispredicted branch pipeline flushing• Execution stalls
• (Waiting on input/Scoreboard, L2 miss, BW, DTLB, glass jaws etc)
• Cycles wasted executing instructions that are not retired
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
29
X86 Cycle Accounting and SW Optimization
Cpu_clk_unhalted = “stalls” + dispatch = “stalls” + non_ret_dispatch + ret_dispatch
Traditional Stall Removal
Reduce Branch MispredictionsPGO
Improve Optimization to Reduce Instruction Count,Split Loops to Increase ILP
Resource_stalls.br_miss_clear will estimate stalls due to Pipeline Flush
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
30
Cycle Accounting on X86
Cycles = “stalls” + dispatch• An equality by definition
Cycles ~ CPU_CLK_UNHALTED.CORE
• For cpu intensive applications/sampling
Stall Cycles = Cycles with NO uops Dispatched= RS_UOPS_DISPATCH.CYCLES_NONE
Dispatch Cycle=RS_UOPS_DISPATCH
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
31
Cycle Accounting on X86 (cont.)
Dispatch ~ cycles_dispatch_retiring_uops + cycles_dispatch_non_retiring_uops
• Assumes no overlap of retired/non retired uops • Worst Case Senario
Non retired uops = rs_uops_dispatched – (uops_retired.any + Uops_retired.fused)
• Non retired uop cycles ~ non retired uops/avg_uops_per_cycle
Fractional Wasted Work = rs_uops_dispatched / (uops_retired.any + uops_retired.fused) - 1
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
32
Pulling Cycle Accounting Together
Cycle Accounting
0
0.2
0.4
0.6
0.8
1
1.2
Executing
Stalls
Illustrative Example Only, Not Real Data
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
33
Decomposing Stalls: Elephants First
Pipeline Flush = Resource_Stalls.Br_Miss_Clear/cyclesL2 Hits = ( MEM_LOAD_RETIRED.L1D_LINE_MISS -
MEM_LOAD_RETIRED.L2_LINE_MISS )* 10/cyclesDTLB/L2 Miss = event count* penalty/cyclesFE + Scoreboard = Stalls – all of the above
Stall Decomposition
0
0.2
0.4
0.6
0.8
1
1.2
1 2
Executing
FE + Scoreboard
Pipeline Flush
DTLB
L2 Hits
L2 Misses
Stall Total
Illustrative Example Only, Not Real Data
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
34
Decomposing Unstalled CyclesDecomposing Unstalled Cycles
0.75
0.8
0.85
0.9
0.95
1
1.05
1
Uops Retiring
OOO Bursts
Non_retired
Stalls
Non_Retired = (( 1 – (Uops_retired.any+Uops_retired.fused)/RS_Uops_Dispatched) *
RS_Uops_Dispatched.Cycles_None / CPU_CLK_UNHALTED.CORE
OOO Bursts = Uops_Retired.Any - Stalls – Non_RetiredIllustrative Example Only, Not Real Data
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
35
Pulling it All Together
Risks Over-counting / Minimizing FE + Scoreboard
But Offers a Guide to Execution Inefficiencies
Cycle Decomposition
0
0.2
0.4
0.6
0.8
1
1.2
1
Uops Retiring
OOO Bursts
Non_retired
FE +Scoreboard
Pipeline Flush
DTLB
L2 Hits
L2 Misses
Illustrative Example Only, Not Real Data
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
36
The “Big 4” Events for Performance
CYCLES, STALLS, UNPREFETCHED LOADS and BANDWIDTH
CPU_CLK_UNHALTED.CORE
RS_UOPS_DISPATCHED.CYCLES.NONE
MEM_LOAD_RETIRED.L2_LINE_MISS
BUS_TRANS_ANY.SELF
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
37
Architectural Pitfalls: The Ants
Issue Performance Counter
Approx. Penalty (cycles)
store to unknown addr preceeds load Load_Blocks.ADR ~5
store forwarding 4 bytes from middle of 8 Load_Blocks.Overlap_Store ~6
store to known address precedes load offset by N*4096 Load_Blocks. Overlap_Store ~6
load from 2 cachelines (not in L1D) Load_Blocks.UNTIL_RETIRE ~22
load from 2 cachelines with preceding store(not in L1D Load_Blocks.UNTIL_RETIRE ~20
Length Changing Prefix (16 bit imm) ILD_STALLS
ILD_STALLS, or ~6 per
Contribute to “FE + Scoreboard”And don’t forget Micro-Fusion, Macro-fusion, etc..
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
38
A Heuristic Break-down for Stall Analysis
the “Big 4 (L2 cache)”, L1D cache
…………
Front End Stalls
Stalls?Stalls?
Resource Stalls
Exe Unit Stalls
Retirement Efficiencyand others
…………
RS related and RAT related
………… Register related, Domain related
………… Instructions decoding, LCP…
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
39
A Heuristic Break-down for Stall Analysis (cont.)
Stall Components Counters Name Solutions
Front End
L2 cache MEM_LOAD_RETIRED.L2_LINE_MISS Alignment
DTLB MEM_LOAD_RETIRED.DTLB_MISS SW prefetch
L1 data cache MEM_LOAD_RETIRED.L1D_LINE_MISS
Instruction Queue INST_QUEUE.FULL Decode pattern
Branch prediction RESOURCE_STALLS.BR_MISS_CLEAR PGO, Removing uncertainty or brach
Execution Core
Reservation station RESOURCE_STALLS.RS_FULL
ReOrder Buffer RAT_STALLS.ROB_READ_PORT
RESOURCE_STALLS.ROB_FULL
Dispatching RS_UOPS_DISPATCHED
Partial updating RAT_STALLS.FLAGS Whole register update
RAT_STALLS.PARTIAL_CYCLES
Domain swing RESOURCE_STALLS.FPCW
FP_MMX_TRANS.TO_MMX
FP_MMX_TRANS.TO_FP
Memory BUS_TRANS_ANY
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
40
Lab Activity 2:Using SW tools to reduce the instruction counts (path length)
In this lab, you will practice the use of Intel compiler vectorization switch to reduce the instruction counts.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
41
Lab Activity 3:Addressing the performance bottleneck in Front End
In this lab, you will identify and address the performance issue caused in the Front End of the processor by the “Big 4” events analysis.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
42
Lab Activity 4:Addressing the performance bottleneck in Execution Core
In this lab, you will identify and address the performance issue caused in the execution core of the processor.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
43
A Loop Methodology
• Identify hot functions and raise optimization• Fix alignments, split loops to enhance vectorization
• Identify BW limited functions• Merge BW loops with FP limited loops
• Identify L2 misses and add sw prefetch
• Optimize flow through OOO Engine• Use loop splitting to assist here
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
44
More Detailed Event Selection Hierarchy
FIRST PASS EVENTS Sample After Value
CPU_CLK_UNHALTED.CORE 2,000,000
RS_UOPS_DISPATCHED.CYCLES_NONE 2,000,000
UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED 2,000,000
RS_UOPS_DISPATCHED 2,000,000
MEM_LOAD_RETIRED.L2_LINE_MISS 10,000
INST_RETIRED.ANY_P 2,000,000
Loops
BUS_TRANS_ANY.SELF 100,000
BUS_TRANS_ANY.ALL_AGENTS 100,000
Branch Dominated
RESOURCE_STALLS.BR_MISS_CLEAR 2,000,000
SAV values selected so ratio of samples ~ absorbs penalty
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
45
More Detailed Event Selection Hierarchy (cont.)
SECOND LEVEL EVENTS Sample After Value
MEM_LOAD_RETIRED.DTLB_MISS 20,000
MEM_LOAD_RETIRED.L1_LINE_MISS 200,000
BR_CND_EXEC BR_CND_EXEC_MISPRED 2,000,000
BR_CALL_EXEC BR_CALL_EXEC_MISPRED 200,000
RESOURCE_STALLS.RS_FULL (anti correlate) 2,000,000
ILD_STALLS 200,000
LOAD_BLOCK.STORE_OVERLAP 200,000
SAV values selected so ratio of samples ~ absorbs penalty
EX: L1 miss/L2_hit penalty is 10 cycles
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
46
Summary
• Utilize CoreTM micro-architecture for software performance• Front end• OOO execution core
• Use the VTune™ analyzer to identify micro-architectural bottlenecks in your software.
• Use a cycles accounting methodology to improve the performance.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
47
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
48
Micro-Architecture ComparisonIntel NetBurst™++ NGMA**
Pipeline Stages 31 14
Threads per core 2 1
L1 Cache Org. (12K uop Trace Cache/16K Data) (32K I/32K Data)
L2 Cache Org. 2 x 2MB 1 x 4MB (shared)
Instr. Decoders 1 4
Integer Units 2 (2x core freq) 3 (1x core freq)
SIMD Units 2 x 64-bits 3 x 128-bits
SIMD Inst. Issued per Clock 1 3
FP Units 3 (Add/Mul/Div) 3 (Add/Mul/Div)
FP Inst. Issued per clock 1 Up to 2
(Add + Mul or Div)
Power 135W 80W
++ Cedar Mill/Dempsey** NGMA = Next Generation Micro-Architecture (Conroe/Woodcrest) = per core
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
49
Execution Unit Comparisons
FP Add/Mul/DivInteger
Shift/Rotate SIMD
Port
Port
IntegerMultiply
SIMD
IntegerArithmetic
IntegerArithmetic
2x Core Freq
Intel NetBurst® Micro-Architecture
NGMA
Port
0P
ort
1 FP Add
SIMD
Port
5
IntegerArithmetic
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
2
Load
Port
4
Store
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
50
DTLB Structure
DTLB component entries ways sets miss event ~ miss penalty
L0 small page 16 4 4 Dtlb_Misses.L0_miss 2
L1 small page 256 4 64 Dtlb_Misses.L1_miss typical ~ 10
L0 Large Page 16 4 4 Dtlb_Misses.L0_miss_LG 2
L1 Large Page 32 4 8 Dtlb_Misses.L1_miss_LG typical ~ 11-12
HW Page Walks PMH.Walks ~PMH.Cycles
DTLB Access Penalty
0
5
10
15
20
25
0 200 400 600 800 1000 1200
number of pages accessed
cycle
s
L2 $ Hit, L1DTLB Miss
L1 $ Hit, L1DTLB Miss
L1 $ Hit, L1DTLB Hit
Disclaimer: Data is from a pointer chasing microbenchmark and for illustrative purposes only
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
51
PEBS Usage and Issues• Using Precise Event Based Sampling captures architectural
state at the time of the event occurrence
• Basic Block Execution = average of inst_retired over the BB
• However inst_retired does not give a flat distribution within a basic block.• Therefore the average over the basic block should be used
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
52
Manipulating the XML File
<EVENT>
<HELPID>CB08</HELPID>
<CODE>0xCB</CODE> event number
<UMASK>0x08</UMASK> event mask or user mask
<OTHER>0x53</OTHER> Cmask, Inv etc
<COMMON>0x601001</COMMON> bitmask for groups event is in…add 2 to put in “favorites”
<WEIGHT>0</WEIGHT>
<COUNTER>0</COUNTER> counters that can be used..precise events must use 0
<NAME>MEM_LOAD_RETIRED.L2_LINE_MISS</NAME>
<DESCRIPTION>L2 cache line missed by retired loads (precise event).</DESCRIPTION>
<HELP_FILE>pmm.chm</HELP_FILE>
<OVERFLOW>10000</OVERFLOW> default SAV
<PRECISE_EVENT>yes</PRECISE_EVENT> identifier for precise events
</EVENT>
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
53
DL’s New Favorite <EVENT>
<HELPID>A000</HELPID>
<CODE>0xA0</CODE>
<UMASK>0x00</UMASK>
<OTHER>0x1D3</OTHER> setting cmask = 1 and inv = 1
<COMMON>0x503</COMMON>
<WEIGHT>0</WEIGHT>
<COUNTER>0</COUNTER> forcing counter 0
<NAME>RS_UOPS_DISPATCHED_c1_inv</NAME> new name
<DESCRIPTION>Uops Dispatched from the RS</DESCRIPTION>
<HELP_FILE>pmm.chm</HELP_FILE>
<OVERFLOW>2000000</OVERFLOW>
</EVENT>
Cycles Where NO Uops are Dispatched From RS
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
54
Loop Distribution for Resource Management
For(i…i++){inst1
inst2 inst3
.
.
. instN (final store) }
For(i..i+=blk){for(j=I;j<blk;j++){
ints1inst2.instMstore_intermediate[j-i]}
for(j=I;j<blk;j++){load_intermediate[j-i]]instM+1.instN (final store)}
}
Shorter Loops -> Greater Unrolling -> Greater ILP
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
55
Cycle Accounting on X86• Non retired uop cycles ~
non retired uops / avg_uops_per_cycle
~ rs_uops_dispatched:c1*( 1 - (uops_retired.any + uops_retired.fused)
/rs_uops_dispatched )
CPU_CLK_UNHALTED = Stalls + non_retired + effective = rs_uops_dispatched:c1:i1 +
rs_uops_dispatched:c1* ( 1 – (uops_retired.any +uops_retired.fused)
/ rs_uops_dispatched ) + Effective_cycles
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
56
Methodology Overview
The traditional view of performance tuning on X86 processors has focused on instruction retirement
The OOO engine has always been viewed as an impenetrable and incomprehensible beast
This is perhaps not the best perspective
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
57
Four Component HW Prefetcher
• L1 Cache Prefetch (first in Intel® Core Duo Processor)• DCU or Streaming prefetcher
• DCU = Data Cache Unit
• IP prefetch• Repeated stride load at frequently executed IP
• L2 Prefetch (similar to Pentium™ 4 processor)
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
58
VTune™ Analyzer Edit Event
See Backup Slides for Creating New Pre-Edited Events in XML File
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
59
Some Features of the PMU
CMASKINV
EN
INT
PC E
OS
USR umask Event #
Value to be compared against Invert from GE to LT
Enable Counters
APIC Interupt Enable
Pin Control
Count on changing edge
Count Ring 3 execution
Count Ring 0 execution
Setting CMASK = 1 and INV = 1 for RS_uops_dispatched Counts Cycles Where
NO UOPS WERE DISPATCHED == Stalls RS_UOPS_DISPATCHED.CYCLES_NONE
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
60
A Methodology?
Total Cycles ~ CPU_CLK_UNHALTED
RS_UOPS_DISPATCH:c1
RS_UOPS_DISPATCH:c1:i1
CPU_CLK_UNHALTED can be decomposed into execution and stall cycles in the OOO engine
Requires >99% CPU Utilization OR User PL only/sampling
EVENTS COUNT EVEN DURING HALTED CYCLES
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
61
VTune™ Analyzer Event Basics
Thread Specific and Independent Event Categories
Thread Specific (TS) – Sample count is per logical processor.
Thread Independent (TI) – Sample count is per physical processor.
• All events are attributed to logical processor 0 – WATCH OUT: The Addresses Might Be Incorrect!
Thread specific ESCR limited (TS-E) – Sample count is per logical processor but only data for one logical processor can be captured in a single run.
If not specified, the event is TS.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Intel® Software College
Copyright © 2007, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
62
The Distribution of uops/cycleemon -q -t0 -C \(RS_UOPS_DISPATCHED:v\) -f $1_uop_count.txt $1
Up to N uops/cycle
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c1:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c2:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c3:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c4:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c5:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c6:i1:v\) -F $1_uop_count.txt $1
emon -q -t0 -C \(RS_UOPS_DISPATCHED:c7:i1:v\) -F $1_uop_count.txt $1
Subtract the N-1 value
uops dispatched per cycle
02000000000
40000000006000000000
800000000010000000000
1200000000014000000000
1600000000018000000000
0 2 4 6 8
Series1
Replace with Vtune graph
Distributionof theInstructionLevelParallelism (example:a[i] = exp(x[i]); in a simple loop)