Applied mechanical sympathy - Simone Bordet - Codemotion Milan 2014

35
Simone Bordet [email protected] Applied Mechanical Sympathy on

Transcript of Applied mechanical sympathy - Simone Bordet - Codemotion Milan 2014

Simone [email protected]

AppliedMechanicalSympathy

on

Simone [email protected]

Who Am I

Simone Bordet [email protected] @simonebordet

Open Source Contributor Jetty, CometD, MX4J, Foxtrot, LiveTribe, JBoss, Larex

Lead Architect at Intalio/Webtide Jetty's SPDY and HTTP client maintainer

CometD project leader Web messaging framework

JVM tuning expert

Simone [email protected]

Agenda

Mechanical Sympathy

Introduction to modern processors

Examples

Q&A

Simone [email protected]

MechanicalSympathy

Simone [email protected]

Mechanical Sympathy

Term coined by Jackie Stewart (Formula 1 driver) Jackie Stewart believed that to be a great driver one

must have a mechanical sympathy for how a car works to get the best out of it.

Term “revived” by Martin Thompson (Disruptor fame) But now applied to software (driver) and hardware (car)

Servers/libraries should be mechanically sympathetic To exploit the full power of modern CPUs

Simone [email protected]

Introduction toModern Processors

Simone [email protected]

CPU Cache Latencies

L1 ~1 ns

L2 ~3 ns

L3 ~10 ns

RAM ~65 ns

Simone [email protected]

CPU Execution

Modern CPUs are super-scalar Each core has multiple execution units Independent operations can be executed in parallel in

the same clock cycle by different execution units a = b + c d = e + f

Modern CPUs perform out-of-order-execution Semantic ordering retained via instruction retirement

a = b * c => multiplication takes few clocks d = a + 1 => depends on a, must wait e = b + 1 => can be executed before d

Simone [email protected]

CPU Execution Quiz

int[] a = new int[2];

// Loop 1

for (int i=0; i<bazillion; ++i) { a[0]++; a[0]++; }

// Loop 2

for (int i=0; i<bazillion; ++i) { a[0]++; a[1]++; }

Loop 2 could be twice as fast Provided no optimizations to first loop: a[0] += 2;

Simone [email protected]

CPU Instruction Rate

CPUs are fast 3 GHz => 3 clock cycles / ns

CPUs are parallel 4x (or more) instructions per clock cycle

Potential Instruction Rate 3 x 4 = 12 instructions / ns

Cache miss to main memory 12 x 65 = 780 instructions !

Compilers and JITs help with reordering Developers must pay attention to memory access patterns

Simone [email protected]

CPU Access Patterns

CPUs are good at guessing what is accessed next Prefetch of instructions and data into caches

CPUs access memory in “cache lines” 64 bytes of contiguous data

Different cores can access the same cache line MESI protocol to keep the data consistent

MESI – Modified / Exclusive / Shared / Invalid

Simone [email protected]

CPU Access Pattern Quiz

int[] arr = new int[bazillion];

// Loop 1

for (int i = 0; i < arr.length; ++i) arr[i] *= 3;

// Loop 2

for (int i = 0; i < arr.length; i += 16) arr[i] *= 3;

Loop 2 does only 1/16 – 6.25% – of the work Same performance Loop is dominated by memory access, not computation

Simone [email protected]

CPU Branch Prediction

CPU branch predictor improves instruction flow Branch predictor keeps track of branch history

An if/else branch is guessed The branch is speculatively executed

Guess is right => instructions are retired

Guess is wrong => 3-6 ns penalty + re-execution

Simone [email protected]

Tools

Java Microbenchmark Harness – JMH http://openjdk.java.net/projects/code-tools/jmh/ Vastly superior to any other framework

Probably the only non-flawed solution for HotSpot

Linux Perf Tools perf list

Shows hardware events

perf stat Attaches to a running process to gather events

Simone [email protected]

Real code examplesfrom Jetty 9.x

Simone [email protected]

HEX Conversion

Branching code

String hex = "A6";

int value = (x2i(hex.charAt(0)) << 4) | x2i(hex.charAt(1));

public int x2i(char c)

{

if (c >= 'A' && c <= 'F')

return 10 + c – 'A';

else

return c – '0';

}

Simone [email protected]

HEX Conversion

Branchless code

String hex = "A6";

int value = (x2i(hex.charAt(0)) << 4) | x2i(hex.charAt(1));

public int x2i(char c)

{

return (c & 0x1F) + ((c >> 6) * 0x19) – 0x10;

}

Simone [email protected]

HEX Conversion Benchmark

JMH benchmark runs (bigger is better) Branching: 270.686 ops/ms Branchless: 701.206 ops/ms

perf stat -e branches -e branch-misses Branching

1.27 insns/cycle 13,353,788,164 branches 11.92% branch-misses

Branchless 2.31 insns/cycle 2,553,447,200 branches 0.02% branch-misses

Simone [email protected]

False Sharing

Jetty uses a customized queue for its thread pool

public class BlockingArrayQueue

{

private int _head;

private int _tail;

}

Memory Layout

header _head _tail

8 bytes 4 bytes 4 bytes

Simone [email protected]

False Sharing

Core 0 Core 1

h t h t

h t h t

h t

L1

L2

L3

Simone [email protected]

False Sharing

Avoid false sharing by padding

public class BlockingArrayQueue

{

private int _head;

private int h01, h02, h03, h04, h05, h06, h07, h08;

private int h11, h12, h13, h14, h15, h16, h17, h18;

private int _tail;

private int t01, t02, t03, t04, t05, t06, t07, t08;

private int t11, t12, t13, t14, t15, t16, t17, t18;

}

Simone [email protected]

False Sharing Benchmark

JMH Benchmark runs (bigger is better) Dense: 624,644.870 ops/ms Padded: 790,501.896 ops/ms Improvement: 26%

j.u.c.LinkedBlockingQueue

JDK 8: sun.misc.Contended

Simone [email protected]

Quiz

SLF4J Logger.debug(String, Object...)

int read = socket.read(buffer);

logger.debug("Bytes read {}", read);

Vararg array creation

Boxing of the read variable Values are likely not cached

Simone [email protected]

HTTP Processing

IO Select

Read BytesParse Bytes

Invoke Servlet

Simone [email protected]

HTTP Processing

Jetty 8: select + dispatch + process + invoke

IO Select

Read BytesParse Bytes

Invoke Servlet

Selector Thread Worker Thread

Simone [email protected]

HTTP Processing

Jetty 9.0.0.M3: select + process + dispatch + invoke

IO Select

Read BytesParse Bytes

Invoke Servlet

Selector Thread Worker Thread

Simone [email protected]

HTTP Processing Benchmark

Pipeline Benchmark, 1 M requests Jetty 8: 37.696 s Jetty 9.M3: 47.212 s

Even if Jetty 9.0.0.M3 was faster in each step the overall performance was horrible Parallel slowdown

From Jetty 9.0.0.M4 onward we switched back to the select + dispatch + process + invoke model

Simone [email protected]

HTTP Processing Benchmark

Pipeline Benchmark, 1 M requests Jetty 8: 37.696 s Jetty 9.M4+: 29.172 s

Jetty 9 is way faster than Jetty 8 Produces less garbage too

Simone [email protected]

JDK 7 NIO2 AsyncSocketChannel

JDK7's NIO AsynchronousSocketChannel I/O operations and CompletionHandler notifications only

happen in the “channel group” thread

Two problems: Unnecessary context switch for I/O operations Stack overflow for I/O operations performed by

completion handlers Additional context switch after N recursions

Simone [email protected]

Atomic State Machines

Jetty 9: no synchronized blocks in the core All replaced by atomic state machines

No StackOverflowError typical of asynchronous callbacks Without context switch to a new thread

public abstract class IteratingCallback implements Callback

{

protected enum State

{ IDLE, PROCESSING, PENDING, ... };

private final AtomicReference<State> _state =

new AtomicReference<>(State.IDLE);

}

Simone [email protected]

Atomic State Machines

public void succeeded()

{

loop: while (true)

{

State current = _state.get();

switch (current)

{

case PENDING:

if (_state.compareAndSet(current, State.PROCESSING))

break loop;

continue;

...

}

}

}

Simone [email protected]

Conclusions

Know your hardware You will become a better programmer

Do not try this at home These optimizations only matter in the right context Ask library writers to optimize though !

Use Jetty 9.x We do care about these optimizations

Simone [email protected]

Questions&

Answers