@kuksenk0#JavaPerfAndHWC
Java Performance: Speedup your applications with hardware counters
Sergey Kuksenko@kuksenk0
@kuksenk0 #JavaPerfAndHWC
The following is intended to outline our general productdirection. It is intended for information purposes only, and maynot be incorporated into any contract. It is not a commitmentto deliver any material, code, or functionality, and should not berelied upon in making purchasing decisions. The development,release, and timing of any features or functionality described forOracleβs products remains at the sole discretion of Oracle.
2/64
@kuksenk0 #JavaPerfAndHWC
Intriguing Introductory Example
Gotcha, here is my hot methodor
On Algorithmic Oβptimizations
βjust-O
3/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Codepublic Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
π(π3)β
βbig-O
4/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Codepublic Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
π(π3)β
βbig-O
4/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
N multiply strassen
32 33 ππ 64 252 ππ 128 3019 ππ 2119 ππ 256 56 ππ 15 ππ 512 488 ππ 111 ππ 1024 3270 ππ 785 ππ 2048 32 π 6 π 4096 309 π 42 π 8192 2611 π 293 π
time/op
6/64
@kuksenk0 #JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256β
per multiplication per iterationCPI 1.06cycles 146Γ 106 8.7instructions 137Γ 106 8.2L1-loads 68Γ 106 4L1-load-misses 33Γ 106 2L1-stores 0.56Γ 106
L1-store-misses 38357LLC-loads 26Γ 106 1.6LLC-stores 11595
ββΌ 256Kb per matrix
8/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
L1-load-misses
9/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
9/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiplyIKJ(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
int aik = A[i][k];
for (int j = 0; j < size; j++) {
R[i][j] += aik * B[k][j];
}
}
}
return result;
}
10/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ
strassenIKJ
32 33 ππ 13 ππ 64 252 ππ 80 ππ 128 3019 ππ 2119 ππ 464 ππ 256 56 ππ 15 ππ 4 ππ 512 488 ππ 111 ππ 29 ππ
25 ππ
1024 3270 ππ 785 ππ 369 ππ
192 ππ
2048 32 π 6 π 3.4 π
1.4 π
4096 309 π 42 π 25 π
10 π
8192 2611 π 293 π 210 π
73 π
time/op
11/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Very Hot Code
N multIJK strassenIJK multIKJ strassenIKJ
32 33 ππ 13 ππ 64 252 ππ 80 ππ 128 3019 ππ 2119 ππ 464 ππ 256 56 ππ 15 ππ 4 ππ 512 488 ππ 111 ππ 29 ππ 25 ππ 1024 3270 ππ 785 ππ 369 ππ 192 ππ 2048 32 π 6 π 3.4 π 1.4 π 4096 309 π 42 π 25 π 10 π 8192 2611 π 293 π 210 π 73 π
time/op
11/64
@kuksenk0 #JavaPerfAndHWC
Example 1: JMH, -prof perfnorm, N=256
IJK IKJmultiply iteration multiply iteration
CPI 1.06 0.51cycles 146Γ 106 8.7 9.7Γ 106 0.6instructions 137Γ 106 8.2 19Γ 106 1.1L1-loads 68Γ 106 4 5.4Γ 106 0.3L1-load-misses 33Γ 106 2 1.1Γ 106 0.1L1-stores 0.56Γ 106 2.7Γ 106 0.2L1-store-misses 38357 8959LLC-loads 26Γ 106 1.6 0.3Γ 106
LLC-stores 11595 3532
12/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore
9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc
0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx
0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge
...
Vectorization (SSE/AVX)!
13/64
@kuksenk0 #JavaPerfAndHWC
Example 1: Free beer benefits!
cycles insts
...
6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore
9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc
0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx
0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge
...
Vectorization (SSE/AVX)!
13/64
@kuksenk0 #JavaPerfAndHWC
Chapter 1
Performance Optimization Methodology:A Very Short Introduction.
Three magic questions and the direction of the
journey
14/64
@kuksenk0 #JavaPerfAndHWC
Three magic questions
What? β Where? β How?
β What prevents my application to work faster?(monitoring)
β Where does it hide?(profiling)
β How to stop it messing with performance?(tuning/optimizing)
15/64
@kuksenk0 #JavaPerfAndHWC
Three magic questions
What?
β Where? β How?
β What prevents my application to work faster?(monitoring)
β Where does it hide?(profiling)
β How to stop it messing with performance?(tuning/optimizing)
15/64
@kuksenk0 #JavaPerfAndHWC
Three magic questions
What? β Where?
β How?
β What prevents my application to work faster?(monitoring)
β Where does it hide?(profiling)
β How to stop it messing with performance?(tuning/optimizing)
15/64
@kuksenk0 #JavaPerfAndHWC
Three magic questions
What? β Where? β How?
β What prevents my application to work faster?(monitoring)
β Where does it hide?(profiling)
β How to stop it messing with performance?(tuning/optimizing)
15/64
@kuksenk0 #JavaPerfAndHWC
Top-Down Approach
β System Levelβ Network, Disk, OS, CPU/Memory
β JVM Levelβ GC/Heap, JIT, Classloading
β Application Levelβ Algorithms, Synchronization, Threading, API
β Microarchitecture Levelβ Caches, Data/Code alignment, CPU Pipeline Stalls
16/64
@kuksenk0 #JavaPerfAndHWC
Methodology in Essence
http://j.mp/PerfMindMap
17/64
@kuksenk0 #JavaPerfAndHWC
Methodology in Essence
http://j.mp/PerfMindMap
17/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Letβs go (Top-Down)
Do system monitoring (e.g. mpstat)
β Lots of %sys β . . .β
β Lots of %irq, %soft β . . .β
β Lots of %iowait β . . .β
β Lots of %idle β . . .β
β Lots of %user
18/64
@kuksenk0 #JavaPerfAndHWC
Chapter 2
High CPU Load
and
the main question:
Β«who/what is to blame?Β»
19/64
@kuksenk0 #JavaPerfAndHWC
CPU Utilization
β What does βΌ100% CPU Utilization mean?
β OS has enough tasks to schedule
Can profiling help?
Profilerβ shows Β«WHEREΒ» application time is spent,but thereβs no answer to the question Β«WHYΒ».
βtraditional profiler
20/64
@kuksenk0 #JavaPerfAndHWC
CPU Utilization
β What does βΌ100% CPU Utilization mean?β OS has enough tasks to schedule
Can profiling help?
Profilerβ shows Β«WHEREΒ» application time is spent,but thereβs no answer to the question Β«WHYΒ».
βtraditional profiler
20/64
@kuksenk0 #JavaPerfAndHWC
CPU Utilization
β What does βΌ100% CPU Utilization mean?β OS has enough tasks to schedule
Can profiling help?
Profilerβ shows Β«WHEREΒ» application time is spent,but thereβs no answer to the question Β«WHYΒ».
βtraditional profiler
20/64
@kuksenk0 #JavaPerfAndHWC
CPU Utilization
β What does βΌ100% CPU Utilization mean?β OS has enough tasks to schedule
Can profiling help?
Profilerβ shows Β«WHEREΒ» application time is spent,but thereβs no answer to the question Β«WHYΒ».
βtraditional profiler
20/64
@kuksenk0 #JavaPerfAndHWC
Who/What is to blame?
Complex CPU microarchitecture:
β Inefficient algorithm β 100% CPU
β Pipeline stall due to memory load/stores β 100% CPU
β Pipeline flush due to mispredicted branch β 100% CPU
β Expensive instructions β 100% CPU
β Insufficient ILPβ β 100% CPU
β etc. β 100% CPU
βInstruction Level Parallelism
21/64
@kuksenk0 #JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events(performance monitoring events)
23/64
@kuksenk0 #JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events(performance monitoring events)
23/64
@kuksenk0 #JavaPerfAndHWC
Events: Issues
β Hundreds of events
β Microarchitectural experience is required
β Platform dependentβ vary from CPU vendor to vendor
β may vary when CPU manufacturer introduces newmicroarchitecture
β How to work with HWC?
25/64
@kuksenk0 #JavaPerfAndHWC
HWC
HWC modes:
β Counting modeβ if (event_happened) ++counter;
β general monitoring (answering Β«WHAT?Β» )
β Sampling modeβ if (event_happened)
if (++counter < threshold) INTERRUPT;
β profiling (answering Β«WHERE?Β»)
26/64
@kuksenk0 #JavaPerfAndHWC
HWC: Issues
So many events, so few counters
e.g. βNehalemβ:
β over 900 events
β 7 HWC (3 fixed + 4 programmable)
β Β«multiple-runningΒ» (different events)β Repeatability
β Β«multiplexingΒ» (only a few tools are able to do that)β Steady state
27/64
@kuksenk0 #JavaPerfAndHWC
HWC: Issues (cont.)
Sampling mode:
β Β«instruction skidΒ»(hard to correlate event and instruction)
β Uncore events(hard to bind event and execution thread;e.g. shared L3 cache)
28/64
@kuksenk0 #JavaPerfAndHWC
HWC: typical usages
β hardware validation
β performance analysis
β run-time tuning (e.g. JRockit, etc.)
β security attacks and defenses
β test code coverage
β etc.
29/64
@kuksenk0 #JavaPerfAndHWC
HWC: typical usages
β hardware validation
β performance analysis
β run-time tuning (e.g. JRockit, etc.)
β security attacks and defenses
β test code coverage
β etc.
29/64
@kuksenk0 #JavaPerfAndHWC
HWC: tools
β Oracle Solaris Studio Performance Analyzer(http://www.oracle.com/technetwork/server-storage/solarisstudio)
β perf/perf_events (http://perf.wiki.kernel.org)
β JMH(http://openjdk.java.net/projects/code-tools/jmh/)
-prof perf
-prof perfnorm = perf, normalized per operation
-prof perfasm = perf + -XX:+PrintAssembly
30/64
@kuksenk0 #JavaPerfAndHWC
HWC: tools (cont.)
β AMD CodeXL
β Intel Vtune Amplifier XE
β etc. . .
31/64
@kuksenk0 #JavaPerfAndHWC
perf events (e.g.)
β cycles
β instrustions
β cache-references
β cache-misses
β branches
β branch-misses
β bus-cycles
β ref-cycles
β dTLB-loads
β dTLB-load-misses
β L1-dcache-loads
β L1-dcache-load-misses
β L1-dcache-stores
β L1-dcache-store-misses
β LLC-loads
β etc...
32/64
@kuksenk0 #JavaPerfAndHWC
Oracle Studio events (e.g.)
β cycles
β insts
β branch-instruction-retired
β branch-misses-retired
β dtlbm
β l1h
β l1m
β l2h
β l2m
β l3h
β l3m
β etc...
33/64
@kuksenk0 #JavaPerfAndHWC
Chapter 4
Iβve got HWC data. Whatβs next?or
Introduction to microarchitecture
performance analysis
34/64
@kuksenk0 #JavaPerfAndHWC
Execution time
π‘πππ = ππ¦ππππ πππππ’ππππ¦
Optimization is. . .
reducing the (spent) cycle count!β
βeverything else is overclocking
35/64
@kuksenk0 #JavaPerfAndHWC
Microarchitecture Equation
ππ¦ππππ = πππ‘βπΏππππ‘β * πΆππΌ = πππ‘βπΏππππ‘β * 1πΌππΆ
β PathLength - number of instructions
β CPI - cycles per instruction
β IPC - instructions per cycle
36/64
@kuksenk0 #JavaPerfAndHWC
πππ‘βπΏππππ‘β * πΆππΌ
β PathLength βΌ algorithm efficiency (the smaller the better)
β CPI βΌ CPU efficiency (the smaller the better)β πΆππΌ = 4 β bad!
β πΆππΌ = 1β Nehalem β just good enough!
β SandyBridge and later β not so good!
β πΆππΌ = 0.4 β good!
β πΆππΌ = 0.2 β ideal!
37/64
@kuksenk0 #JavaPerfAndHWC
What to do?
β low CPIβ reduce PathLength β Β«tune algorithmΒ»
β high CPI β stallsβ memory stalls β Β«tune data structuresΒ»
β branch stalls β Β«tune control logicΒ»
β instruction dependency β Β«break dependency chainsΒ»
β long latency ops β Β«use more simple operationsΒ»
β etc. . .
38/64
@kuksenk0 #JavaPerfAndHWC
High CPI: Memory bound
β dTLB misses
β L1,L2,L3,...,LN misses
β NUMA: non-local memory access
β memory bandwidth
β false/true sharing
β cache line split (not in Java world, except. . . )
β store forwarding (unlikely, hard to fix on Java level)
β 4k aliasing
39/64
@kuksenk0 #JavaPerfAndHWC
High CPI: Core bound
β long latency operations: DIV, SQRT
β FP assist: floating points denormal, NaN, inf
β bad speculation (caused by mispredicted branch)
β port saturation (forget it)
40/64
@kuksenk0 #JavaPerfAndHWC
High CPI: Front-End bound
β iTLB miss
β iCache miss
β branch mispredict
β LSD (loop stream decoder)
solvable by HotSpot tweaking
41/64
@kuksenk0 #JavaPerfAndHWC
Example 2: perf stat -d
880851.993237 task -clock (msec)
39,318 context -switches
437 cpu -migrations
7,931 page -faults
2 ,277 ,063 ,113 ,376 cycles
1 ,226 ,299 ,634 ,877 instructions # 0.54 insns per cycle
229 ,500 ,265 ,931 branches # 260.544 M/sec
4 ,620 ,666 ,169 branch -misses # 2.01% of all branches
338 ,169 ,489 ,902 L1-dcache -loads # 383.912 M/sec
37 ,937 ,596 ,505 L1-dcache -load -misses # 11.22% of all L1-dcache hits
25 ,232 ,434 ,666 LLC -loads # 28.645 M/sec
7 ,307 ,884 ,874 L1-icache -load -misses # 0.00% of all L1 -icache hits
337 ,730 ,278 ,697 dTLB -loads # 382.846 M/sec
6 ,094 ,356 ,801 dTLB -load -misses # 1.81% of all dTLB cache hits
12 ,210 ,841 ,909 iTLB -loads # 13.863 M/sec
431 ,803 ,270 iTLB -load -misses # 3.54% of all iTLB cache hits
301.557213044 seconds time elapsed
44/64
@kuksenk0 #JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6
Letβs count:
4
2
Γ (338Γ 109)+12Γ ((38β 25)Γ 109)+36Γ (25Γ 109) β2408Γ 109 cycles ???
Potential Issues?
Supersc
alarin A
ction!
Issue!
45/64
@kuksenk0 #JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6
Letβs count:
4
2
Γ (338Γ 109)+12Γ ((38β 25)Γ 109)+36Γ (25Γ 109) β2408Γ 109 cycles ???
Potential Issues?
Supersc
alarin A
ction!
Issue!
45/64
@kuksenk0 #JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6
Letβs count:
4
2
Γ (338Γ 109)+12Γ ((38β 25)Γ 109)+36Γ (25Γ 109) β2408Γ 109 cycles ???
Potential Issues?
Supersc
alarin A
ction!
Issue!
45/64
@kuksenk0 #JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6
Letβs count:
4
2
Γ (338Γ 109)+12Γ ((38β 25)Γ 109)+36Γ (25Γ 109) β2408Γ 109 cycles ???
Potential Issues?
Supersc
alarin A
ction!
Issue!
45/64
@kuksenk0 #JavaPerfAndHWC
Example 2: Processed perf stat -d
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6
Letβs count:
4
2
Γ (338Γ 109)+12Γ ((38β 25)Γ 109)+36Γ (25Γ 109) β2408Γ 109 cycles ???
Potential Issues?
Supersc
alarin A
ction!
Issue!
45/64
@kuksenk0 #JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
β a memory cache that stores recent translations of virtualmemory addresses to physical addresses
β each memory access β access to TLB
β TLB miss may take hundreds cycles
How to check?
β dtlb_load_misses_miss_causes_a_walk
β dtlb_load_misses_walk_duration
46/64
@kuksenk0 #JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
β a memory cache that stores recent translations of virtualmemory addresses to physical addresses
β each memory access β access to TLB
β TLB miss may take hundreds cycles
How to check?
β dtlb_load_misses_miss_causes_a_walk
β dtlb_load_misses_walk_duration
46/64
@kuksenk0 #JavaPerfAndHWC
Example 2: dTLB misses
IPC 0.54cycles Γ109 2277instructions Γ109 1226L1-dcache-loads Γ109 338L1-dcache-load-misses Γ109 38LLC-loads Γ109 25dTLB-loads Γ109 338dTLB-load-misses Γ109 6dTLB-walks-durationβ Γ109 296
13%of cycles (time)
βin cycles
47/64
@kuksenk0 #JavaPerfAndHWC
Example 2: dTLB misses
Fixing it:
β Enable -XX:+UseLargePages
β try to shrink application working set
48/64
@kuksenk0 #JavaPerfAndHWC
Example 2: -XX:+UseLargePages
baseline large pages
IPC 0.54 0.64cycles Γ109 2277 2277instructions Γ109 1226 1460L1-dcache-loads Γ109 338 401L1-dcache-load-misses Γ109 38 38LLC-loads Γ109 25 25dTLB-loads Γ109 338 401dTLB-load-misses Γ109 6 0.24dTLB-walks-duration Γ109 296 2.6
Boost β 20%
49/64
@kuksenk0 #JavaPerfAndHWC
Example 2: normalized per transaction
baseline large pages
IPC 0.54 0.64cycles Γ106 23.4 19.5instructions Γ106 12.5 12.5L1-dcache-loads Γ106 3.45 3.45L1-dcache-load-misses Γ106 0.39 0.33LLC-loads Γ106 0.26 0.21dTLB-loads Γ106 3.45 3.45dTLB-load-misses Γ106 0.06 0.002dTLB-walks-duration Γ106 3.04 0.022
Boost
50/64
@kuksenk0 #JavaPerfAndHWC
Example 3
Β«A plague on both your housesΒ»
Β«++ on both your threadsΒ»
51/64
@kuksenk0 #JavaPerfAndHWC
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int field1;
}
@Benchmark
@Group("baseline")
public int justDoIt(StateBaseline s) {
return s.field0 ++;
}
@Benchmark
@Group("baseline")
public int doItFromOtherThread(StateBaseline s) {
return s.field1 ++;
}
52/64
@kuksenk0 #JavaPerfAndHWC
Example 3: Measure
resourceβ sharing paddedSame Core (HT) L1 9.5 4.9Diff Cores (within socket) L3 (LLC) 10.6 2.8Diff Sockets nothing 18.2 2.8
average time, ns/op
βshared between threads
53/64
@kuksenk0 #JavaPerfAndHWC
Example 3: What do HWC tell us?
Same Core Diff Cores Diff Socketssharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4cycles 33130 17536 36012 9163 46484 9608instructions 26418 25865 26550 25747 26717 25768L1-loads 12593 9467 9696 8973 9672 9016L1-load-misses 10 5 12 4 33 3L1-stores 4317 7838 7433 4069 6935 4074L1-store-misses 5 2 161 2 55 1LLΠ‘-loads 4 3 58 1 32 1LLC-load-misses 1 1 53 β0 35 β0LLC-stores 1 1 183 β0 49 β0LLC-store-misses 1 β0 182 β0 48 β0
β All values are normalized per 103 operations
54/64
@kuksenk0 #JavaPerfAndHWC
Example 3: Β«on a core and a prayerΒ»
in case of the single core(HT) we have to look intoMACHINE_CLEARS.MEMORY_ORDERING
Same Core Diff Cores Diff Socketssharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4cycles 33130 17536 36012 9163 46484 9608instructions 26418 25865 26550 25747 26717 25768CLEARS 238 β0 β0 β0 β0 β0
55/64
@kuksenk0 #JavaPerfAndHWC
Example 3: Diff Cores
L2_STORE_LOCK_RQSTS - L2 RFOs breakdown
Diff Cores Diff Socketssharing padded sharing padded
CPI 1.4 0.4 1.7 0.4LLC-stores 183 β0 49 β0LLC-store-misses 182 β0 48 β0L2_STORE_LOCK_RQSTS.MISS 134 β0 33 β0L2_STORE_LOCK_RQSTS.HIT_E β0 β0 β0 β0L2_STORE_LOCK_RQSTS.HIT_M β0 β0 β0 β0L2_STORE_LOCK_RQSTS.ALL 183 β0 49 β0
56/64
@kuksenk0 #JavaPerfAndHWC
Example 3: Diff Cores
Diff Cores Diff Socketssharing padded sharing padded
CPI 1.4 0.4 1.7 0.4LLC-stores 183 β0 49 β0LLC-store-misses 182 β0 48 β0L2_STORE_LOCK_RQSTS.MISS 134 β0 33 β0L2_STORE_LOCK_RQSTS.ALL 183 β0 49 β0
Why 183 > 49 & 134 > 33,but the same socket case is faster?
58/64
@kuksenk0 #JavaPerfAndHWC
Example 3: Some events count duration
For example:
OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO
Diff Cores Diff Socketssharing padded sharing padded
CPI 1.4 0.4 1.7 0.4cycles 36012 9163 46484 9608instructions 26550 25747 26717 25768O_R_O.CYCLES_W_D_RFO 21723 10 29601 56
59/64
@kuksenk0 #JavaPerfAndHWC
Summary: Performance is easy
To achieve high performance:
β You have to know your Application!
β You have to know your Frameworks!
β You have to know your Virtual Machine!
β You have to know your Operating System!
β You have to know your Hardware!
61/64
@kuksenk0 #JavaPerfAndHWC
Summary: Β«Here be dragonsΒ»
To achieve high performance:
β You have to know your Application!
β You have to know your Frameworks!
β You have to know your Virtual Machine!
β You have to know your Operating System!
β You have to know your Hardware!
61/64
@kuksenk0 #JavaPerfAndHWC
Enlarge your knowledge with these simple tricks!
Reading list:β βComputer Architecture: A Quantitative ApproachβJohn L. Hennessy, David A. Patterson
β CPU vendors documentation
β http://www.agner.org/optimize/
β http://www.google.com/search?q=Hardware+
performance+counter
β etc. . .
62/64
Top Related