Download - Java Performance: Speedup your application with hardware counters

Transcript

@kuksenk0#JavaPerfAndHWC

Java Performance: Speedup your applications with hardware counters

Sergey Kuksenko@kuksenk0

@kuksenk0 #JavaPerfAndHWC

The following is intended to outline our general productdirection. It is intended for information purposes only, and maynot be incorporated into any contract. It is not a commitmentto deliver any material, code, or functionality, and should not berelied upon in making purchasing decisions. The development,release, and timing of any features or functionality described forOracle’s products remains at the sole discretion of Oracle.

2/64

@kuksenk0 #JavaPerfAndHWC

Intriguing Introductory Example

Gotcha, here is my hot methodor

On Algorithmic O⋆ptimizations

⋆just-O

3/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Codepublic Matrix multiply(Matrix other) {

int size = data.length;

Matrix result = new Matrix(size);

int [][] R = result.data;

int [][] A = this.data;

int [][] B = other.data;

for (int i = 0; i < size; i++) {

for (int j = 0; j < size; j++) {

int s = 0;

for (int k = 0; k < size; k++) {

s += A[i][k] * B[k][j];

}

R[i][j] = s;

}

}

return result;

}

𝑂(𝑁3)⋆

⋆big-O

4/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Codepublic Matrix multiply(Matrix other) {

int size = data.length;

Matrix result = new Matrix(size);

int [][] R = result.data;

int [][] A = this.data;

int [][] B = other.data;

for (int i = 0; i < size; i++) {

for (int j = 0; j < size; j++) {

int s = 0;

for (int k = 0; k < size; k++) {

s += A[i][k] * B[k][j];

}

R[i][j] = s;

}

}

return result;

}

𝑂(𝑁3)⋆

⋆big-O

4/64

@kuksenk0 #JavaPerfAndHWC

Example 1: "Ok Google"

𝑂(𝑁2.81)

5/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

N multiply strassen

32 33 πœ‡π‘ 64 252 πœ‡π‘ 128 3019 πœ‡π‘  2119 πœ‡π‘ 256 56 π‘šπ‘  15 π‘šπ‘ 512 488 π‘šπ‘  111 π‘šπ‘ 1024 3270 π‘šπ‘  785 π‘šπ‘ 2048 32 𝑠 6 𝑠4096 309 𝑠 42 𝑠8192 2611 𝑠 293 𝑠

time/op

6/64

@kuksenk0 #JavaPerfAndHWC

Try the other way

7/64

@kuksenk0 #JavaPerfAndHWC

Example 1: JMH, -prof perfnorm, N=256⋆

per multiplication per iterationCPI 1.06cycles 146Γ— 106 8.7instructions 137Γ— 106 8.2L1-loads 68Γ— 106 4L1-load-misses 33Γ— 106 2L1-stores 0.56Γ— 106

L1-store-misses 38357LLC-loads 26Γ— 106 1.6LLC-stores 11595

β‹†βˆΌ 256Kb per matrix

8/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

public Matrix multiply(Matrix other) {

int size = data.length;

Matrix result = new Matrix(size);

int [][] R = result.data;

int [][] A = this.data;

int [][] B = other.data;

for (int i = 0; i < size; i++) {

for (int j = 0; j < size; j++) {

int s = 0;

for (int k = 0; k < size; k++) {

s += A[i][k] * B[k][j] ;

}

R[i][j] = s;

}

}

return result;

}

L1-load-misses

9/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

public Matrix multiply(Matrix other) {

int size = data.length;

Matrix result = new Matrix(size);

int [][] R = result.data;

int [][] A = this.data;

int [][] B = other.data;

for (int i = 0; i < size; i++) {

for (int j = 0; j < size; j++) {

int s = 0;

for (int k = 0; k < size; k++) {

s += A[i][k] * B[k][j] ;

}

R[i][j] = s;

}

}

return result;

}

9/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

public Matrix multiplyIKJ(Matrix other) {

int size = data.length;

Matrix result = new Matrix(size);

int [][] R = result.data;

int [][] A = this.data;

int [][] B = other.data;

for (int i = 0; i < size; i++) {

for (int k = 0; k < size; k++) {

int aik = A[i][k];

for (int j = 0; j < size; j++) {

R[i][j] += aik * B[k][j];

}

}

}

return result;

}

10/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

N multIJK strassenIJK multIKJ

strassenIKJ

32 33 πœ‡π‘  13 πœ‡π‘ 64 252 πœ‡π‘  80 πœ‡π‘ 128 3019 πœ‡π‘  2119 πœ‡π‘  464 πœ‡π‘ 256 56 π‘šπ‘  15 π‘šπ‘  4 π‘šπ‘ 512 488 π‘šπ‘  111 π‘šπ‘  29 π‘šπ‘ 

25 π‘šπ‘ 

1024 3270 π‘šπ‘  785 π‘šπ‘  369 π‘šπ‘ 

192 π‘šπ‘ 

2048 32 𝑠 6 𝑠 3.4 𝑠

1.4 𝑠

4096 309 𝑠 42 𝑠 25 𝑠

10 𝑠

8192 2611 𝑠 293 𝑠 210 𝑠

73 𝑠

time/op

11/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Very Hot Code

N multIJK strassenIJK multIKJ strassenIKJ

32 33 πœ‡π‘  13 πœ‡π‘ 64 252 πœ‡π‘  80 πœ‡π‘ 128 3019 πœ‡π‘  2119 πœ‡π‘  464 πœ‡π‘ 256 56 π‘šπ‘  15 π‘šπ‘  4 π‘šπ‘ 512 488 π‘šπ‘  111 π‘šπ‘  29 π‘šπ‘  25 π‘šπ‘ 1024 3270 π‘šπ‘  785 π‘šπ‘  369 π‘šπ‘  192 π‘šπ‘ 2048 32 𝑠 6 𝑠 3.4 𝑠 1.4 𝑠4096 309 𝑠 42 𝑠 25 𝑠 10 𝑠8192 2611 𝑠 293 𝑠 210 𝑠 73 𝑠

time/op

11/64

@kuksenk0 #JavaPerfAndHWC

Example 1: JMH, -prof perfnorm, N=256

IJK IKJmultiply iteration multiply iteration

CPI 1.06 0.51cycles 146Γ— 106 8.7 9.7Γ— 106 0.6instructions 137Γ— 106 8.2 19Γ— 106 1.1L1-loads 68Γ— 106 4 5.4Γ— 106 0.3L1-load-misses 33Γ— 106 2 1.1Γ— 106 0.1L1-stores 0.56Γ— 106 2.7Γ— 106 0.2L1-store-misses 38357 8959LLC-loads 26Γ— 106 1.6 0.3Γ— 106

LLC-stores 11595 3532

12/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Free beer benefits!

cycles insts

...

6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore

9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc

0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx

0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge

...

Vectorization (SSE/AVX)!

13/64

@kuksenk0 #JavaPerfAndHWC

Example 1: Free beer benefits!

cycles insts

...

6.42% 5.54% 0x00007ff6591d20c5: vmovdqu %ymm1 ,0x10(%r14 ,%rbx ,4) ;* iastore

9.49% 11.91% 0x00007ff6591d20cc: add $0x8 ,%ebx ;*iinc

0.15% 0.05% 0x00007ff6591d20cf: cmp %r11d ,%ebx

0x00007ff6591d20d2: jl 0x00007ff6591d20b2 ;* if_icmpge

...

Vectorization (SSE/AVX)!

13/64

@kuksenk0 #JavaPerfAndHWC

Chapter 1

Performance Optimization Methodology:A Very Short Introduction.

Three magic questions and the direction of the

journey

14/64

@kuksenk0 #JavaPerfAndHWC

Three magic questions

What? β‡’ Where? β‡’ How?

βˆ™ What prevents my application to work faster?(monitoring)

βˆ™ Where does it hide?(profiling)

βˆ™ How to stop it messing with performance?(tuning/optimizing)

15/64

@kuksenk0 #JavaPerfAndHWC

Three magic questions

What?

β‡’ Where? β‡’ How?

βˆ™ What prevents my application to work faster?(monitoring)

βˆ™ Where does it hide?(profiling)

βˆ™ How to stop it messing with performance?(tuning/optimizing)

15/64

@kuksenk0 #JavaPerfAndHWC

Three magic questions

What? β‡’ Where?

β‡’ How?

βˆ™ What prevents my application to work faster?(monitoring)

βˆ™ Where does it hide?(profiling)

βˆ™ How to stop it messing with performance?(tuning/optimizing)

15/64

@kuksenk0 #JavaPerfAndHWC

Three magic questions

What? β‡’ Where? β‡’ How?

βˆ™ What prevents my application to work faster?(monitoring)

βˆ™ Where does it hide?(profiling)

βˆ™ How to stop it messing with performance?(tuning/optimizing)

15/64

@kuksenk0 #JavaPerfAndHWC

Top-Down Approach

βˆ™ System Level– Network, Disk, OS, CPU/Memory

βˆ™ JVM Level– GC/Heap, JIT, Classloading

βˆ™ Application Level– Algorithms, Synchronization, Threading, API

βˆ™ Microarchitecture Level– Caches, Data/Code alignment, CPU Pipeline Stalls

16/64

@kuksenk0 #JavaPerfAndHWC

Methodology in Essence

http://j.mp/PerfMindMap

17/64

@kuksenk0 #JavaPerfAndHWC

Methodology in Essence

http://j.mp/PerfMindMap

17/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Let’s go (Top-Down)

Do system monitoring (e.g. mpstat)

βˆ™ Lots of %sys β‡’ . . .⇓

βˆ™ Lots of %irq, %soft β‡’ . . .⇓

βˆ™ Lots of %iowait β‡’ . . .⇓

βˆ™ Lots of %idle β‡’ . . .⇓

βˆ™ Lots of %user

18/64

@kuksenk0 #JavaPerfAndHWC

Chapter 2

High CPU Load

and

the main question:

Β«who/what is to blame?Β»

19/64

@kuksenk0 #JavaPerfAndHWC

CPU Utilization

βˆ™ What does ∼100% CPU Utilization mean?

– OS has enough tasks to schedule

Can profiling help?

Profiler⋆ shows Β«WHEREΒ» application time is spent,but there’s no answer to the question Β«WHYΒ».

⋆traditional profiler

20/64

@kuksenk0 #JavaPerfAndHWC

CPU Utilization

βˆ™ What does ∼100% CPU Utilization mean?– OS has enough tasks to schedule

Can profiling help?

Profiler⋆ shows Β«WHEREΒ» application time is spent,but there’s no answer to the question Β«WHYΒ».

⋆traditional profiler

20/64

@kuksenk0 #JavaPerfAndHWC

CPU Utilization

βˆ™ What does ∼100% CPU Utilization mean?– OS has enough tasks to schedule

Can profiling help?

Profiler⋆ shows Β«WHEREΒ» application time is spent,but there’s no answer to the question Β«WHYΒ».

⋆traditional profiler

20/64

@kuksenk0 #JavaPerfAndHWC

CPU Utilization

βˆ™ What does ∼100% CPU Utilization mean?– OS has enough tasks to schedule

Can profiling help?

Profiler⋆ shows Β«WHEREΒ» application time is spent,but there’s no answer to the question Β«WHYΒ».

⋆traditional profiler

20/64

@kuksenk0 #JavaPerfAndHWC

Who/What is to blame?

Complex CPU microarchitecture:

βˆ™ Inefficient algorithm β‡’ 100% CPU

βˆ™ Pipeline stall due to memory load/stores β‡’ 100% CPU

βˆ™ Pipeline flush due to mispredicted branch β‡’ 100% CPU

βˆ™ Expensive instructions β‡’ 100% CPU

βˆ™ Insufficient ILP⋆ β‡’ 100% CPU

βˆ™ etc. β‡’ 100% CPU

⋆Instruction Level Parallelism

21/64

@kuksenk0 #JavaPerfAndHWC

Chapter 3

Hardware Counters

HWC, PMU - WTF?

22/64

@kuksenk0 #JavaPerfAndHWC

PMU: Performance Monitoring Unit

Performance Monitoring Unit - profiles hardware activity,built into CPU.

PMU Internals (in less than 21 seconds) :

Hardware counters (HWC) count hardware performance events(performance monitoring events)

23/64

@kuksenk0 #JavaPerfAndHWC

PMU: Performance Monitoring Unit

Performance Monitoring Unit - profiles hardware activity,built into CPU.

PMU Internals (in less than 21 seconds) :

Hardware counters (HWC) count hardware performance events(performance monitoring events)

23/64

@kuksenk0 #JavaPerfAndHWC

Events

Vendor’s documentation! (e.g. 2 pages from 32)

24/64

@kuksenk0 #JavaPerfAndHWC

Events: Issues

βˆ™ Hundreds of events

βˆ™ Microarchitectural experience is required

βˆ™ Platform dependent– vary from CPU vendor to vendor

– may vary when CPU manufacturer introduces newmicroarchitecture

βˆ™ How to work with HWC?

25/64

@kuksenk0 #JavaPerfAndHWC

HWC

HWC modes:

βˆ™ Counting mode– if (event_happened) ++counter;

– general monitoring (answering Β«WHAT?Β» )

βˆ™ Sampling mode– if (event_happened)

if (++counter < threshold) INTERRUPT;

– profiling (answering Β«WHERE?Β»)

26/64

@kuksenk0 #JavaPerfAndHWC

HWC: Issues

So many events, so few counters

e.g. β€œNehalem”:

– over 900 events

– 7 HWC (3 fixed + 4 programmable)

βˆ™ Β«multiple-runningΒ» (different events)– Repeatability

βˆ™ Β«multiplexingΒ» (only a few tools are able to do that)– Steady state

27/64

@kuksenk0 #JavaPerfAndHWC

HWC: Issues (cont.)

Sampling mode:

βˆ™ Β«instruction skidΒ»(hard to correlate event and instruction)

βˆ™ Uncore events(hard to bind event and execution thread;e.g. shared L3 cache)

28/64

@kuksenk0 #JavaPerfAndHWC

HWC: typical usages

βˆ™ hardware validation

βˆ™ performance analysis

βˆ™ run-time tuning (e.g. JRockit, etc.)

βˆ™ security attacks and defenses

βˆ™ test code coverage

βˆ™ etc.

29/64

@kuksenk0 #JavaPerfAndHWC

HWC: typical usages

βˆ™ hardware validation

βˆ™ performance analysis

βˆ™ run-time tuning (e.g. JRockit, etc.)

βˆ™ security attacks and defenses

βˆ™ test code coverage

βˆ™ etc.

29/64

@kuksenk0 #JavaPerfAndHWC

HWC: tools

βˆ™ Oracle Solaris Studio Performance Analyzer(http://www.oracle.com/technetwork/server-storage/solarisstudio)

βˆ™ perf/perf_events (http://perf.wiki.kernel.org)

βˆ™ JMH(http://openjdk.java.net/projects/code-tools/jmh/)

-prof perf

-prof perfnorm = perf, normalized per operation

-prof perfasm = perf + -XX:+PrintAssembly

30/64

@kuksenk0 #JavaPerfAndHWC

HWC: tools (cont.)

βˆ™ AMD CodeXL

βˆ™ Intel Vtune Amplifier XE

βˆ™ etc. . .

31/64

@kuksenk0 #JavaPerfAndHWC

perf events (e.g.)

βˆ™ cycles

βˆ™ instrustions

βˆ™ cache-references

βˆ™ cache-misses

βˆ™ branches

βˆ™ branch-misses

βˆ™ bus-cycles

βˆ™ ref-cycles

βˆ™ dTLB-loads

βˆ™ dTLB-load-misses

βˆ™ L1-dcache-loads

βˆ™ L1-dcache-load-misses

βˆ™ L1-dcache-stores

βˆ™ L1-dcache-store-misses

βˆ™ LLC-loads

βˆ™ etc...

32/64

@kuksenk0 #JavaPerfAndHWC

Oracle Studio events (e.g.)

βˆ™ cycles

βˆ™ insts

βˆ™ branch-instruction-retired

βˆ™ branch-misses-retired

βˆ™ dtlbm

βˆ™ l1h

βˆ™ l1m

βˆ™ l2h

βˆ™ l2m

βˆ™ l3h

βˆ™ l3m

βˆ™ etc...

33/64

@kuksenk0 #JavaPerfAndHWC

Chapter 4

I’ve got HWC data. What’s next?or

Introduction to microarchitecture

performance analysis

34/64

@kuksenk0 #JavaPerfAndHWC

Execution time

π‘‘π‘–π‘šπ‘’ = π‘π‘¦π‘π‘™π‘’π‘ π‘“π‘Ÿπ‘’π‘žπ‘’π‘’π‘›π‘π‘¦

Optimization is. . .

reducing the (spent) cycle count!⋆

⋆everything else is overclocking

35/64

@kuksenk0 #JavaPerfAndHWC

Microarchitecture Equation

𝑐𝑦𝑐𝑙𝑒𝑠 = π‘ƒπ‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃𝐼 = π‘ƒπ‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 1𝐼𝑃𝐢

βˆ™ PathLength - number of instructions

βˆ™ CPI - cycles per instruction

βˆ™ IPC - instructions per cycle

36/64

@kuksenk0 #JavaPerfAndHWC

π‘ƒπ‘Žπ‘‘β„ŽπΏπ‘’π‘›π‘”π‘‘β„Ž * 𝐢𝑃𝐼

βˆ™ PathLength ∼ algorithm efficiency (the smaller the better)

βˆ™ CPI ∼ CPU efficiency (the smaller the better)– 𝐢𝑃𝐼 = 4 – bad!

– 𝐢𝑃𝐼 = 1βˆ™ Nehalem – just good enough!

βˆ™ SandyBridge and later – not so good!

– 𝐢𝑃𝐼 = 0.4 – good!

– 𝐢𝑃𝐼 = 0.2 – ideal!

37/64

@kuksenk0 #JavaPerfAndHWC

What to do?

βˆ™ low CPI– reduce PathLength β†’ Β«tune algorithmΒ»

βˆ™ high CPI β‡’ stalls– memory stalls β†’ Β«tune data structuresΒ»

– branch stalls β†’ Β«tune control logicΒ»

– instruction dependency β†’ Β«break dependency chainsΒ»

– long latency ops β†’ Β«use more simple operationsΒ»

– etc. . .

38/64

@kuksenk0 #JavaPerfAndHWC

High CPI: Memory bound

βˆ™ dTLB misses

βˆ™ L1,L2,L3,...,LN misses

βˆ™ NUMA: non-local memory access

βˆ™ memory bandwidth

βˆ™ false/true sharing

βˆ™ cache line split (not in Java world, except. . . )

βˆ™ store forwarding (unlikely, hard to fix on Java level)

βˆ™ 4k aliasing

39/64

@kuksenk0 #JavaPerfAndHWC

High CPI: Core bound

βˆ™ long latency operations: DIV, SQRT

βˆ™ FP assist: floating points denormal, NaN, inf

βˆ™ bad speculation (caused by mispredicted branch)

βˆ™ port saturation (forget it)

40/64

@kuksenk0 #JavaPerfAndHWC

High CPI: Front-End bound

βˆ™ iTLB miss

βˆ™ iCache miss

βˆ™ branch mispredict

βˆ™ LSD (loop stream decoder)

solvable by HotSpot tweaking

41/64

@kuksenk0 #JavaPerfAndHWC @kuksenk0 #JavaPerfAndHWC

Exam

ples

@kuksenk0 #JavaPerfAndHWC

Example 2

Some Β«largeΒ» standard server-sideJava benchmark

43/64

@kuksenk0 #JavaPerfAndHWC

Example 2: perf stat -d

880851.993237 task -clock (msec)

39,318 context -switches

437 cpu -migrations

7,931 page -faults

2 ,277 ,063 ,113 ,376 cycles

1 ,226 ,299 ,634 ,877 instructions # 0.54 insns per cycle

229 ,500 ,265 ,931 branches # 260.544 M/sec

4 ,620 ,666 ,169 branch -misses # 2.01% of all branches

338 ,169 ,489 ,902 L1-dcache -loads # 383.912 M/sec

37 ,937 ,596 ,505 L1-dcache -load -misses # 11.22% of all L1-dcache hits

25 ,232 ,434 ,666 LLC -loads # 28.645 M/sec

7 ,307 ,884 ,874 L1-icache -load -misses # 0.00% of all L1 -icache hits

337 ,730 ,278 ,697 dTLB -loads # 382.846 M/sec

6 ,094 ,356 ,801 dTLB -load -misses # 1.81% of all dTLB cache hits

12 ,210 ,841 ,909 iTLB -loads # 13.863 M/sec

431 ,803 ,270 iTLB -load -misses # 3.54% of all iTLB cache hits

301.557213044 seconds time elapsed

44/64

@kuksenk0 #JavaPerfAndHWC

Example 2: Processed perf stat -d

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6

Let’s count:

4

2

Γ— (338Γ— 109)+12Γ— ((38βˆ’ 25)Γ— 109)+36Γ— (25Γ— 109) β‰ˆ2408Γ— 109 cycles ???

Potential Issues?

Supersc

alarin A

ction!

Issue!

45/64

@kuksenk0 #JavaPerfAndHWC

Example 2: Processed perf stat -d

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6

Let’s count:

4

2

Γ— (338Γ— 109)+12Γ— ((38βˆ’ 25)Γ— 109)+36Γ— (25Γ— 109) β‰ˆ2408Γ— 109 cycles ???

Potential Issues?

Supersc

alarin A

ction!

Issue!

45/64

@kuksenk0 #JavaPerfAndHWC

Example 2: Processed perf stat -d

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6

Let’s count:

4

2

Γ— (338Γ— 109)+12Γ— ((38βˆ’ 25)Γ— 109)+36Γ— (25Γ— 109) β‰ˆ2408Γ— 109 cycles ???

Potential Issues?

Supersc

alarin A

ction!

Issue!

45/64

@kuksenk0 #JavaPerfAndHWC

Example 2: Processed perf stat -d

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6

Let’s count:

4

2

Γ— (338Γ— 109)+12Γ— ((38βˆ’ 25)Γ— 109)+36Γ— (25Γ— 109) β‰ˆ2408Γ— 109 cycles ???

Potential Issues?

Supersc

alarin A

ction!

Issue!

45/64

@kuksenk0 #JavaPerfAndHWC

Example 2: Processed perf stat -d

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6

Let’s count:

4

2

Γ— (338Γ— 109)+12Γ— ((38βˆ’ 25)Γ— 109)+36Γ— (25Γ— 109) β‰ˆ2408Γ— 109 cycles ???

Potential Issues?

Supersc

alarin A

ction!

Issue!

45/64

@kuksenk0 #JavaPerfAndHWC

Example 2: dTLB misses

TLB = Translation Lookaside Buffer

βˆ™ a memory cache that stores recent translations of virtualmemory addresses to physical addresses

βˆ™ each memory access β†’ access to TLB

βˆ™ TLB miss may take hundreds cycles

How to check?

βˆ™ dtlb_load_misses_miss_causes_a_walk

βˆ™ dtlb_load_misses_walk_duration

46/64

@kuksenk0 #JavaPerfAndHWC

Example 2: dTLB misses

TLB = Translation Lookaside Buffer

βˆ™ a memory cache that stores recent translations of virtualmemory addresses to physical addresses

βˆ™ each memory access β†’ access to TLB

βˆ™ TLB miss may take hundreds cycles

How to check?

βˆ™ dtlb_load_misses_miss_causes_a_walk

βˆ™ dtlb_load_misses_walk_duration

46/64

@kuksenk0 #JavaPerfAndHWC

Example 2: dTLB misses

IPC 0.54cycles Γ—109 2277instructions Γ—109 1226L1-dcache-loads Γ—109 338L1-dcache-load-misses Γ—109 38LLC-loads Γ—109 25dTLB-loads Γ—109 338dTLB-load-misses Γ—109 6dTLB-walks-duration⋆ Γ—109 296

13%of cycles (time)

⋆in cycles

47/64

@kuksenk0 #JavaPerfAndHWC

Example 2: dTLB misses

Fixing it:

βˆ™ Enable -XX:+UseLargePages

βˆ™ try to shrink application working set

48/64

@kuksenk0 #JavaPerfAndHWC

Example 2: -XX:+UseLargePages

baseline large pages

IPC 0.54 0.64cycles Γ—109 2277 2277instructions Γ—109 1226 1460L1-dcache-loads Γ—109 338 401L1-dcache-load-misses Γ—109 38 38LLC-loads Γ—109 25 25dTLB-loads Γ—109 338 401dTLB-load-misses Γ—109 6 0.24dTLB-walks-duration Γ—109 296 2.6

Boost – 20%

49/64

@kuksenk0 #JavaPerfAndHWC

Example 2: normalized per transaction

baseline large pages

IPC 0.54 0.64cycles Γ—106 23.4 19.5instructions Γ—106 12.5 12.5L1-dcache-loads Γ—106 3.45 3.45L1-dcache-load-misses Γ—106 0.39 0.33LLC-loads Γ—106 0.26 0.21dTLB-loads Γ—106 3.45 3.45dTLB-load-misses Γ—106 0.06 0.002dTLB-walks-duration Γ—106 3.04 0.022

Boost

50/64

@kuksenk0 #JavaPerfAndHWC

Example 3

Β«A plague on both your housesΒ»

Β«++ on both your threadsΒ»

51/64

@kuksenk0 #JavaPerfAndHWC

Example 3: False Sharing

@State(Scope.Group)

public static class StateBaseline {

int field0;

int field1;

}

@Benchmark

@Group("baseline")

public int justDoIt(StateBaseline s) {

return s.field0 ++;

}

@Benchmark

@Group("baseline")

public int doItFromOtherThread(StateBaseline s) {

return s.field1 ++;

}

52/64

@kuksenk0 #JavaPerfAndHWC

Example 3: Measure

resource⋆ sharing paddedSame Core (HT) L1 9.5 4.9Diff Cores (within socket) L3 (LLC) 10.6 2.8Diff Sockets nothing 18.2 2.8

average time, ns/op

⋆shared between threads

53/64

@kuksenk0 #JavaPerfAndHWC

Example 3: What do HWC tell us?

Same Core Diff Cores Diff Socketssharing padded sharing padded sharing padded

CPI 1.3 0.7 1.4 0.4 1.7 0.4cycles 33130 17536 36012 9163 46484 9608instructions 26418 25865 26550 25747 26717 25768L1-loads 12593 9467 9696 8973 9672 9016L1-load-misses 10 5 12 4 33 3L1-stores 4317 7838 7433 4069 6935 4074L1-store-misses 5 2 161 2 55 1LLΠ‘-loads 4 3 58 1 32 1LLC-load-misses 1 1 53 β‰ˆ0 35 β‰ˆ0LLC-stores 1 1 183 β‰ˆ0 49 β‰ˆ0LLC-store-misses 1 β‰ˆ0 182 β‰ˆ0 48 β‰ˆ0

⋆ All values are normalized per 103 operations

54/64

@kuksenk0 #JavaPerfAndHWC

Example 3: Β«on a core and a prayerΒ»

in case of the single core(HT) we have to look intoMACHINE_CLEARS.MEMORY_ORDERING

Same Core Diff Cores Diff Socketssharing padded sharing padded sharing padded

CPI 1.3 0.7 1.4 0.4 1.7 0.4cycles 33130 17536 36012 9163 46484 9608instructions 26418 25865 26550 25747 26717 25768CLEARS 238 β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0

55/64

@kuksenk0 #JavaPerfAndHWC

Example 3: Diff Cores

L2_STORE_LOCK_RQSTS - L2 RFOs breakdown

Diff Cores Diff Socketssharing padded sharing padded

CPI 1.4 0.4 1.7 0.4LLC-stores 183 β‰ˆ0 49 β‰ˆ0LLC-store-misses 182 β‰ˆ0 48 β‰ˆ0L2_STORE_LOCK_RQSTS.MISS 134 β‰ˆ0 33 β‰ˆ0L2_STORE_LOCK_RQSTS.HIT_E β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0L2_STORE_LOCK_RQSTS.HIT_M β‰ˆ0 β‰ˆ0 β‰ˆ0 β‰ˆ0L2_STORE_LOCK_RQSTS.ALL 183 β‰ˆ0 49 β‰ˆ0

56/64

@kuksenk0 #JavaPerfAndHWC

Question!

To the audience

57/64

@kuksenk0 #JavaPerfAndHWC

Example 3: Diff Cores

Diff Cores Diff Socketssharing padded sharing padded

CPI 1.4 0.4 1.7 0.4LLC-stores 183 β‰ˆ0 49 β‰ˆ0LLC-store-misses 182 β‰ˆ0 48 β‰ˆ0L2_STORE_LOCK_RQSTS.MISS 134 β‰ˆ0 33 β‰ˆ0L2_STORE_LOCK_RQSTS.ALL 183 β‰ˆ0 49 β‰ˆ0

Why 183 > 49 & 134 > 33,but the same socket case is faster?

58/64

@kuksenk0 #JavaPerfAndHWC

Example 3: Some events count duration

For example:

OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO

Diff Cores Diff Socketssharing padded sharing padded

CPI 1.4 0.4 1.7 0.4cycles 36012 9163 46484 9608instructions 26550 25747 26717 25768O_R_O.CYCLES_W_D_RFO 21723 10 29601 56

59/64

@kuksenk0 #JavaPerfAndHWC @kuksenk0 #JavaPerfAndHWC

Sum

mar

y

@kuksenk0 #JavaPerfAndHWC

Summary: Performance is easy

To achieve high performance:

βˆ™ You have to know your Application!

βˆ™ You have to know your Frameworks!

βˆ™ You have to know your Virtual Machine!

βˆ™ You have to know your Operating System!

βˆ™ You have to know your Hardware!

61/64

@kuksenk0 #JavaPerfAndHWC

Summary: Β«Here be dragonsΒ»

To achieve high performance:

βˆ™ You have to know your Application!

βˆ™ You have to know your Frameworks!

βˆ™ You have to know your Virtual Machine!

βˆ™ You have to know your Operating System!

βˆ™ You have to know your Hardware!

61/64

@kuksenk0 #JavaPerfAndHWC

Enlarge your knowledge with these simple tricks!

Reading list:βˆ™ β€œComputer Architecture: A Quantitative Approach”John L. Hennessy, David A. Patterson

βˆ™ CPU vendors documentation

βˆ™ http://www.agner.org/optimize/

βˆ™ http://www.google.com/search?q=Hardware+

performance+counter

βˆ™ etc. . .

62/64

@kuksenk0 #JavaPerfAndHWC @kuksenk0 #JavaPerfAndHWC

Than

k

you!

@kuksenk0 #JavaPerfAndHWC @kuksenk0 #JavaPerfAndHWC

Q &

A