Rapid Identification of Architectural Bottlenecks via Precise Event Counting
description
Transcript of Rapid Identification of Architectural Bottlenecks via Precise Event Counting
Rapid Identification of Architectural Bottlenecks via Precise Event
Counting
John Demme, Simha SethumadhavanColumbia University
{jdd,simha}@cs.columbia.edu
2002
CASTL: Computer Architecture and Security Technologies Lab 2
Java
C
C++Visual Basic
Perl
PHP
Javascript
DelphiPython
LispScheme
C#
Ob-jec-
tive-C
Other
Language PopularityPlatforms
Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index
2011
CASTL: Computer Architecture and Security Technologies Lab 3
Java
C
C++
Visual BasicPerl
PHPJavascriptDelphi
PythonLispAda
SchemeC#
Ob-jec-
tive-C
RubyLua
GoOther
Language Popularity
Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index
Platforms
Multicore
Moore’s Law
HOW CAN WE POSSIBLY KEEP UP?
CASTL: Computer Architecture and Security Technologies Lab 4
Architectural Lifecycle
Performance Data
Collection
Human Analysis
Architectural Improvement
CASTL: Computer Architecture and Security Technologies Lab 5
Performance Data Collection• Analytical Models
– Fast, but questionable accuracy• Simulation
– Often the gold standard– Very detailed information– Very slow
• Production Hardware (performance counters)– Very fast– Not very detailed
CASTL: Computer Architecture and Security Technologies Lab 6
Performance Data Collection• Analytical Models
– Fast, but questionable accuracy• Simulation
– Often the gold standard– Very detailed information– Very slow
• Production Hardware (Performance Counters)– Very fast– Not very detailed– Relatively detailed
CASTL: Computer Architecture and Security Technologies Lab 7
ACCURACY, PRECISION & PERTURBATION
A comparison of performance monitoring techniquesand the uncertainty principal
CASTL: Computer Architecture and Security Technologies Lab 8
Accuracy, Precision & Perturbation
• In normal execution, program interacts with microarchitecture as expected
CASTL: Computer Architecture and Security Technologies Lab 9
Normal Program Execution
Corresponding Machine State (Cache, Branch Predictor, etc)
Time
Precise Instrumentation
• When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate
CASTL: Computer Architecture and Security Technologies Lab 10
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor, etc)
Measured Machine State (Cache, Branch Predictor, etc)Start of
mutex_lockStart of
mutex_unlockStart ofbarrier_wait
Time
Performance Counter SW LandscapePrecise
Reads counters whenever program or instrumentation requests a read
Heavyweight
Examples • PAPI• perf_event
Overhead • Proportional to # of reads
• PAPI: 1048ns• Perf_event:
262ns
CASTL: Computer Architecture and Security Technologies Lab 11
Sampling vs. Instrumentation
CASTL: Computer Architecture and Security Technologies Lab 12
Sampled Program Execution
n cycles n cycles
Traditional Instrumented Program Execution
Start ofmutex_lock
Start ofmutex_unlock
Start ofbarrier_wait
• Traditional instrumentation like polling• Sampling uses interrupts
Time
Performance Counter SW LandscapeSampling Precise
Interrupts every n cycles and extrapolates
Reads counters whenever program or instrumentation requests a read
Heavyweight
Examples • vTune• OProfile
• PAPI• perf_event
Overhead • Inversely proportional to n
• Up to 20%• Usually much less
• Proportional to # of reads
• PAPI: 1048ns• Perf_event:
262ns
CASTL: Computer Architecture and Security Technologies Lab 13
The Problem with Sampling
CASTL: Computer Architecture and Security Technologies Lab 14
Sample Interrupt
Is this a critical section?
Corrected with Precision
CASTL: Computer Architecture and Security Technologies Lab 15
Read counter
Read counter
But, Precision Adds Overhead
CASTL: Computer Architecture and Security Technologies Lab 16
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor, etc)
Measured Machine State (Cache, Branch Predictor, etc)
Time
Instrumentation Adds Perturbation
• If instrumentation sections are short, perturbation is reduced and measurements become more accurate
CASTL: Computer Architecture and Security Technologies Lab 17
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor, etc)
Measured Machine State (Cache, Branch Predictor, etc)
Time
Performance Counter SW LandscapeSampling Precise
Interrupts every n cycles and extrapolates
Reads counters whenever program or instrumentation requests a read
Heavyweight Lightweight
Examples • vTune• OProfile
• PAPI• perf_event
Overhead • Inversely proportional to n
• Up to 20%• Usually much less
• Proportional to # of reads
• PAPI: 1048ns• Perf_event:
262ns
CASTL: Computer Architecture and Security Technologies Lab 18
Performance Counter SW LandscapeSampling Precise
Interrupts every n cycles and extrapolates
Reads counters whenever program or instrumentation requests a read
Heavyweight Lightweight
Examples • vTune• OProfile
• PAPI• perf_event
• LiMiT
Overhead • Inversely proportional to n
• Up to 20%• Usually much less
• Proportional to # of reads
• PAPI: 1048ns• Perf_event:
262ns
• Proportional to # of reads
• 11ns
CASTL: Computer Architecture and Security Technologies Lab 19
Related Work• No recent papers for better precise
counting– Original PAPI paper: Browne et al. 2000– Some software, none offering LiMiT’s features
• Characterizing performance counters– Weaver & Dongarra 2010
• Sampling– Counter multiplexing techniques
• Mytkowicz et al. 2007• Azimi et al. 2005
– Trace Alignment• Mytkowicz et al. 2006
CASTL: Computer Architecture and Security Technologies Lab 20
REDUCING COUNTERREAD OVERHEADS
Implementing lightweight, precise monitoring
CASTL: Computer Architecture and Security Technologies Lab 21
Why Precision is SlowAvoid system calls to avoid overheadPerfmon2 & Perf_event LiMiT
Program requests counter read
22CASTL: Computer Architecture and
Security Technologies Lab
Kernel reads counter and returns result
Program uses value
System Call
Syste
m R
et
Program reads counter
Program uses value
Why is thisso hard?
A Self-Monitoring Process
CASTL: Computer Architecture and Security Technologies Lab 23
Run, process, run
CASTL: Computer Architecture and Security Technologies Lab 24
32439
5L1 MissesBranchesCycles
Overflow
CASTL: Computer Architecture and Security Technologies Lab 25
L1 MissesBranchesCycles
2439
7
95100
Psst!
Overflow
CASTL: Computer Architecture and Security Technologies Lab 26
L1 MissesBranchesCycles
247
00
L1 MissesBranchesCycles
00
0
Overflow Space
1
100
Modified Read
CASTL: Computer Architecture and Security Technologies Lab 27
L1 MissesBranchesCycles
247
20
L1 MissesBranchesCycles
00
Overflow Space
100
20100+120
Overflow During Read
CASTL: Computer Architecture and Security Technologies Lab 28
L1 MissesBranchesCycles
247
99
L1 MissesBranchesCycles
00
Overflow Space
0
99
Overflow!
CASTL: Computer Architecture and Security Technologies Lab 29
L1 MissesBranchesCycles
247
00
L1 MissesBranchesCycles
00
Overflow Space
0
1
100
99
Atomicity Violation!
CASTL: Computer Architecture and Security Technologies Lab 30
L1 MissesBranchesCycles
247
0
L1 MissesBranchesCycles
00
Overflow Space
100
99100+199
OS Detection & Correction
CASTL: Computer Architecture and Security Technologies Lab 31
L1 MissesBranchesCycles
247
00
L1 MissesBranchesCycles
00
Overflow Space
0
1
100
99
OS Detection & Correction
CASTL: Computer Architecture and Security Technologies Lab 32
L1 MissesBranchesCycles
247
00
L1 MissesBranchesCycles
00
Overflow Space
100
99Looks like
he was reading that…
0
Atomicity Violation Corrected
CASTL: Computer Architecture and Security Technologies Lab 33
L1 MissesBranchesCycles
247
0
L1 MissesBranchesCycles
00
Overflow Space
100
0100+100
So what does all this effort buy us?
Time to collect 3*107 readingsTime PAPI Perf_event LiMiT Speedup
User 1.26s 0.53s 0.034s 3.7x / 1.56xSystem 30.10s 7.30s 0 ∞
Wall 31.44s 7.87s 0.34s 92x / 23.1x
CASTL: Computer Architecture and Security Technologies Lab 34
Average LiMiT Readout
Number of instructions 5Number of cycles 37.14
Time 11.3 ns
LiMiT Enables Detailed Study• Short counter reads decrease perturbation• Little perturbation allows detailed study of
– Short synchronization regions– Short function calls
• Three Case Studies– Synchronization in production web applications
• Not presented here, see paper– Synchronization changes in MySQL over time– User/Kernel code behavior in runtime libraries
CASTL: Computer Architecture and Security Technologies Lab 35
CASE STUDY:LONGITUDINAL STUDY OF LOCKING
BEHAVIOR IN MYSQLHas MySQL gotten better since the advent of multi-cores?
CASTL: Computer Architecture and Security Technologies Lab 36
Evolution of Locking in MySQL
• Questions to answer– Has MySQL gotten better at locking?– What techniques have been used?
• Methodology– Intercept pthread locking calls– Count overheads and critical sections
CASTL: Computer Architecture and Security Technologies Lab 37
MySQL Synchronization Times
CASTL: Computer Architecture and Security Technologies Lab 38
MySQL 4.1 (2004)
MySQL 5.0 (2005)
MySQL 5.1 (2008)
MySQL 5.5 (Beta, 2009)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FreeLockingLock HeldUnlocking
Perc
enta
ge o
f Exe
cutio
n
MySQL Critical Sections
CASTL: Computer Architecture and Security Technologies Lab 39
MySQL 4.1 (2004)
MySQL 5.0 (2005)
MySQL 5.1 (2008)
MySQL 5.5 (Beta, 2009)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
0
200
400
600
800
1000
1200
1400
Overall Time With Lock Held Avg. Lock Hold Time
Perc
enta
ge o
f Exe
cutio
nw
ith Lo
ck H
eld
Aver
age
Num
ber o
f Cyc
les
Lock
is H
eld
Number of Locks in MySQL
CASTL: Computer Architecture and Security Technologies Lab 40
MySQL 4.1 (2004)
MySQL 5.0 (2005)
MySQL 5.1 (2008)
MySQL 5.5 (Beta, 2009)
0E+00
1E+08
2E+08
3E+08
4E+08
5E+08
6E+08
0E+00
1E+05
2E+05
3E+05
4E+05
Dynamic Locks Static Locks
Dyna
mic
Lock
s
Stati
c Loc
ks
Observations & Implications• Coarser granularity, better performance
– Total critical section time has decreased– Average CS times have increased– Number of locks has decreased
• Performance counters useful for software engineering studies
CASTL: Computer Architecture and Security Technologies Lab 41
CASE STUDY:KERNEL/USERSPACE OVERHEADS
IN RUNTIME LIBRARYDoes code in the kernel and runtime library behave?
CASTL: Computer Architecture and Security Technologies Lab 42
Full System Analysis w/o Simulation
• Questions to answer– How much time do system applications spend
in in runtime libraries?– How well do they perform in them? Why?
• Methodology– Intercept common libc, libm and libpthread
calls– Count user-/kernel- space events during the
calls– Break down by purpose (I/O, Memory, Pthread)
• Applications– MySQL, Apache
• Intel Nehalem MicroarchitectureCASTL: Computer Architecture and Security Technologies Lab 43
Execution Cycles in Library Calls
CASTL: Computer Architecture and Security Technologies Lab 44
MySQL (User) MySQL (Kernel) Apache (User) Apache (Kernel)0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
PthreadsMemoryI/O
Perc
enta
ge o
f Tot
al C
ycle
s
MySQL Clocks per Instruction
CASTL: Computer Architecture and Security Technologies Lab 45
User Kernel Libc Program0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Cloc
ks p
er In
stru
ction
L3 Cache MPKI
CASTL: Computer Architecture and Security Technologies Lab 46
MySQL (User) MySQL (Kernel) Apache (User)0
0.20.40.60.8
11.21.41.61.8
2
I/O Memory Pthreads
L3 M
PKI
Apache (K...0
5
10
15
20
25
30
35
I-Cache Stall Cycles
CASTL: Computer Architecture and Security Technologies Lab 47
MySQL (User) MySQL (Kernel) Apache (User) Apache (Kernel)0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
I/O Memory Pthreads
Perc
enta
ge o
f Tot
al C
ycle
s
22.4% 12.0%
Observations & Implications• Apache is fundamentally I/O bound
– Optimization of the I/O subsystem necessary
• Kernel code suffers from I-Cache stalls– Speculation: bad interrupt instruction
prefetching
• LiMiT yields detailed performance data– Not as accurate or detailed as simulation– But gathered in hours rather than weeks
CASTL: Computer Architecture and Security Technologies Lab 48
CONCLUSIONSResearch Methodology Implications,
Closing thoughts
CASTL: Computer Architecture and Security Technologies Lab 49
Conclusions• Implications from case studies
– MySQL’s multicore experience helped scalability
– Performance counting for non-architecture– Libraries and kernels perform very differently– I/O subsystems can be slow
• Research Methodology– LiMiT can provide detailed results quickly– Simulators are more detailed but slow– Opportunity to build microbenchmarks
• Identify bottlenecks with counters• Verify representativeness with counters• Then simulate
CASTL: Computer Architecture and Security Technologies Lab 50
QUESTIONS?
CASTL: Computer Architecture and Security Technologies Lab 51
BACKUP SLIDESMan down! Need backup!
CASTL: Computer Architecture and Security Technologies Lab 52
Performance Evaluation MethodsAccuracy Precision Speed Cost
Simulators ↑ ↑ ↓ ↑/↓Analytical Models ? ? ↑ ↓Prototype Hardware ↑ ↑ ↑ ↑ProductionHardware ↑/↓ ↑/↓ ↑ ↓
Accuracy and Precisionare traded off
• Production hardware provides performance counters• However, existing interfaces make accuracy/precision tradeoff difficult
53CASTL: Computer Architecture and
Security Technologies Lab
Sampling vs. LiMiT
CASTL: Computer Architecture and Security Technologies Lab 54
Sampled Program Execution
n cycles n cycles
LiMiT Instrumented Program Execution
Start ofmutex_lock
Start ofmutex_unlock
Start ofbarrier_wait
Another process runs
CASTL: Computer Architecture and Security Technologies Lab 55
MilesPushupsSitups
52439
79
Fix: Virtualization
CASTL: Computer Architecture and Security Technologies Lab 56
MilesPushupsSitups
2439
3030 Miles!
I did pretty well today.
No you didn’t.
7
MilesPushupsSitups
2439
7
Avoiding Communication
CASTL: Computer Architecture and Security Technologies Lab 57
MilesPushupsSitups
00
30
LiMiT Operation
CASTL: Computer Architecture and Security Technologies Lab 58
RDTSC
CASTL: Computer Architecture and Security Technologies Lab 59
MySQL Instrumentation Overhead
CASTL: Computer Architecture and Security Technologies Lab 60
None LiMiT perf_event PAPI0.00E+00
5.00E+11
1.00E+12
1.50E+12
2.00E+12
2.50E+12
MySQL Execution Cycles (User Time)
CASE STUDY A:LOCKING IN WEB WORKLOADS
How does web-related software use locks?
CASTL: Computer Architecture and Security Technologies Lab 61
Locking on the Web• Questions to answer
– Is locking a significant concern?– How can architects help?– Are traditional benchmarks similar?
• Methodology– Intercept pthread mutex calls, time w/ LiMiT
• Applications– Firefox– Apache– MySQL– PARSEC
CASTL: Computer Architecture and Security Technologies Lab 62
Execution Time by Region
CASTL: Computer Architecture and Security Technologies Lab 63
Firefox LiMiT
Apache LiMiT
Parsec LiMiT
MySQL LiMiT
Apache PAPI
Parsec PAPI
MySQL PAPI
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FreeLockLock HeldUnlock
Perc
enta
ge o
f Tot
al U
ser C
ycle
s
Locking StatisticsFirefox Apache PARSEC MySQL
Avg. Lock Held Time (cycles) 789 149 118 1076
Dynamic Locks per 10k Cycles 3.24 1.12 0.545 3.18
Static Locks 57 1 17 13853
CASTL: Computer Architecture and Security Technologies Lab 64
Observations & Implications• Applications like Firefox and MySQL use
locks differently from Apache and PARSEC– Many notions of synchronization based on
scientific computing probably don’t apply• Locking overheads up to 8 - 13%
– More efficient mechanisms may be helpful– But, 13% is upper bound on speedup
• MySQL has some very long critical sections– Prime targets for micro-arch optimization– If they run faster, MySQL scales better
CASTL: Computer Architecture and Security Technologies Lab 65
Hardware Enhancements• 64-bit Reads and Writes
– Overflows are primary source of complexity– 64-bit counters w/ full read/write eliminates it
• Destructive Reads– Difference = 2 reads, store, load & subtract– Destructive read difference = 2 reads
• Combined Reads– X86 counter read requires 2 instructions– Combining should reduce overhead
• AMD’s Lightweight Profiling Proposal– Really good, depending on microarchitecture
CASTL: Computer Architecture and Security Technologies Lab 66