Java in High-Performance Computing€¦ · A lotgoing on under the (VM) hood. Bad code may work,...

Post on 21-May-2020

1 views 0 download

Transcript of Java in High-Performance Computing€¦ · A lotgoing on under the (VM) hood. Bad code may work,...

Java in High-Performance Computing

Dawid Weiss

Carrot SearchInstitute of Computing Science, Poznan University of Technology

GeeCon Poznan, 05/2010

Learn from the mistakes of others. You can’t live longenough to make them all yourself.

— Eleanor Roosevelt

Talk outline

• What is “High performance”?

• What is “Java”?

• Measuring performance (benchmarking).

• HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.

Talk outline

• What is “High performance”?

• What is “Java”?

• Measuring performance (benchmarking).

• HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.

Divide-and-conquerstyle algorithm

for (Example e : examples) {e.hasQuiz() ? e.showQuiz() : e.showCode();e.explain();e.deriveConclusions();

}

— PART I —

High PerformanceComputing

High-performance computing (HPC) usessupercomputers and computer clusters to solveadvanced computation problems.

— Wikipedia

Is Java faster than C/C++?The short answer is: it depends.

— Cliff Click

It’s usually hard to makea fast program run faster.

It’s easy to make a slowprogram run even slower.

It’s easy to make fasthardware run slow.

It’s usually hard to makea fast program run faster.

It’s easy to make a slowprogram run even slower.

It’s easy to make fasthardware run slow.

It’s usually hard to makea fast program run faster.

It’s easy to make a slowprogram run even slower.

It’s easy to make fasthardware run slow.

For now, HPC

• limited allowed computation time,

• constrained resources (hardware, memory).

Good HPC software ∝ no (obvious) flaws.

For now, HPC

• limited allowed computation time,

• constrained resources (hardware, memory).

Good HPC software ∝ no (obvious) flaws.

— PART II —

What is Java?

(Recall: Is Java faster than C/C++?)

Example 1

public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);result = sum;

}

public void testSum2() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum2(i, i);result = sum;

}

where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .

Example 1

public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);result = sum;

}

public void testSum2() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum2(i, i);result = sum;

}

where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .

VM sum1 sum2

sun-1.6.0-20

0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04

2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16

0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18

0.04 3.29ibm-1.6.2 0.08 6.28

jrockit-27.5.0 0.18 0.16harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2

0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0

0.18 0.16harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296

0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2 sum3 sum4

sun-1.6.0-20 0.04 2.62 1.05 3.76sun-1.6.0-16 0.04 3.20 1.39 4.99sun-1.5.0-18 0.04 3.29 1.46 5.20

ibm-1.6.2 0.08 6.28 0.16 14.64jrockit-27.5.0 0.18 0.16 1.16 3.18

harmony-r917296 0.17 0.35 9.18 22.49

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

int sum1(int a, int b) {return a + b;

}

Integer sum2(Integer a, Integer b) {return a + b;

}

Integer sum2(Integer a, Integer b) {return Integer.valueOf(

a.intValue() + b.intValue());}

int sum3(int... args) {int sum = 0;for (int i = 0; i < args.length; i++)

sum += args[i];return sum;

}

Integer sum4(Integer... args) {int sum = 0;for (int i = 0; i < args.length; i++) {

sum += args[i];}return sum;

}

Integer sum4(Integer [] args) {// ...

}

Conclusions

• Syntactic sugar may be costly.

• Primitive types are fast.

• Large differences between different VMs.

Example 2

Write once, run anywhere!

But it’s the same VM!

It works on my machine!

private static boolean ready;

public static void startThread() {new Thread() {

public void run() {try {

sleep(2000);} catch (Exception e) { /* ignore */ }System.out.println("Marking loop exit.");ready = true;

}}.start();

}

public static void main(String[] args) {startThread();System.out.println("Entering the loop...");while (!ready) {

// Do nothing.}System.out.println("Done, I left the loop!");

}

while (!ready) {// Do nothing.

}≡?

boolean r = ready;while (!r) {

// Do nothing.}

In most cases true, from a JMM perspective.

while (!ready) {// Do nothing.

}≡?

boolean r = ready;while (!r) {

// Do nothing.}

In most cases true, from a JMM perspective.

JVM Internals. . .

C1:

• fast

• not (much) optimization

C2:

• slow(er) than C1

• a lot of JMM-allowed optimizations

There are hundreds of JVMtuning/diagnostic switches.

My personal favorite:

Conclusions

• Bytecode is far from what is executed.

• A lot going on under the (VM) hood.

• Bad code may work, but will eventually crash.

• HotSpot-level optimizations are good.

• If there is a bug in the HotSpot compiler. . .

Conclusions

• Bytecode is far from what is executed.

• A lot going on under the (VM) hood.

• Bad code may work, but will eventually crash.

• HotSpot-level optimizations are good.

• If there is a bug in the HotSpot compiler. . .

Any other diversifyingfactors?

J2ME

• more VM vendors,

• hardware diversity,

• software and hardware quirks.

Non-JVM target platforms

• Dalvik

• GWT

• IKVM

Conclusions

• There is no “single” Java performance model.

• Performance depends on the VM,environment, class library, hardware.

• Apply benchmark-and-correct cycle.

Benchmarking

Example 3

public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);result = sum;

}

public void testSum1_2() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);}

VM sum1 sum1_2

sun-1.6.0-20

0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum1_2

sun-1.6.0-20 0.04

0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum1_2

sun-1.6.0-20 0.04 0.00

sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum1_2

sun-1.6.0-20 0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’

...010 pushq rbp

subq rsp, #16 # Create framenop # nop for patch_verified_entry

016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

021 ret

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’

...010 pushq rbp

subq rsp, #16 # Create framenop # nop for patch_verified_entry

016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

021 ret

Conclusions

• Benchmarks must be executed to providefeedback.

• HotSpot is smart and effective at removingdead code.

Example 4

@Testpublic void testAdd1() {

int sum = 0;for (int i = 0; i < COUNT; i++) {

sum += add1(i);}guard = sum;

}

public int add1(int i) {return i + 1;

}

Note add1 is virtual.

switch testAdd1

-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining ?

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

switch testAdd1

-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining 0.45

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

Most Java calls aremonomorphic.

HotSpot adjusts tomegamorphic calls

automatically.

Example 5

abstract class Superclass {abstract int call();

}

class Sub1 extends Superclass{ int call() { return 1; } }

class Sub2 extends Superclass{ int call() { return 2; } }

class Sub3 extends Superclass{ int call() { return 3; } }

Superclass[] mixed =initWithRandomInstances(10000);

Superclass[] solid =initWithSub1Instances(10000);

@Testpublic void testMonomorphic() {

int sum = 0;int m = solid.length;for (int i = 0; i < COUNT; i++)

sum += solid[i % m].call();guard = sum;

}

@Testpublic void testMegamorphic() {

int sum = 0;int m = mixed.length;for (int i = 0; i < COUNT; i++)

sum += mixed[i % m].call();guard = sum;

}

VM monomorphic megamorphic

sun-1.6.0-20 0.19 0.32sun-1.6.0-16 0.19 0.34sun-1.5.0-18 0.18 0.34

ibm-1.6.2 0.20 0.30jrockit-27.5.0 0.22 0.29

harmony-r917296 0.27 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

Example 6

@Testpublic void testBitCount1() {

int sum = 0;for (int i = 0; i < COUNT; i++)

sum += Integer.bitCount(i);guard = sum;

}

@Testpublic void testBitCount2() {

int sum = 0;for (int i = 0; i < COUNT; i++)

sum += bitCount(i);guard = sum;

}

/* Copied from* {@link Integer#bitCount}*/

static int bitCount(int i) {// HD, Figure 5-2i = i - ((i >>> 1)

& 0x55555555);i = (i & 0x33333333)

+ ((i >>> 2) & 0x33333333);i = (i + (i >>> 4))

& 0x0f0f0f0f;i = i + (i >>> 8);i = i + (i >>> 16);return i & 0x3f;

}

VM testBitCount1 testBitCount2

sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM testBitCount1 testBitCount2

sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

VM testBitCount1 testBitCount2

sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM testBitCount1 testBitCount2

sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

... -XX:+PrintInlining ...

...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)

Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...

... -XX:+PrintInlining ...

...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)

Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...

... -XX:+PrintOptoAssembly ...

{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1

...0c2 B13: # B12 B14 &lt;- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9

... -XX:+PrintOptoAssembly ...

{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1

...0c2 B13: # B12 B14 &lt;- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9

Conclusions

• Benchmarks must be statistically sound.→ averages, variance, min, max, warm-up phase

• Account for HotSpot optimisations.

• Account for hardware differences.→ test-on-target

• Use domain data and real scenarios.

• Inspect suspicious output with debug JVM.

See more: Cliff Click, http://java.sun.com/javaone/2009/articles/rockstar_click.jsp.

HPPCHigh Performance Primitive Collections

Motivation

• Primitive types: fast and memory-friendly.

• Optional assertions.

• Single-threaded. No fail-fast.

• Fast, fast, fast iterators, with no GC overhead.

• Open internals (explicit implementation).

• Programmers know what they’re doing.

Why not JCF?

public interface List<E> extends Collection<E> {boolean contains(Object o); // [-] contract-enforced methodsIterator<E> iterator(); // [-] iterators over primitive types?Object[] toArray(); // [-] troublesome covariants...

Friendly Competition• fastutil

• PCJ

• GNU Trove

• Apache Mahout (ported COLT)

• Apache Primitive Collections

All of these have pros and cons and deal with JCF compatibilitysomehow.

Iterators in fastutil or PCJ

interface IntIterator extends Iterator<Integer> {// Primitive-specific methodint nextInt();

}

Iterators in HPPC

public final class IntCursor {public int index;public int value;

}

public class IntArrayList extends Iterable<IntCursor> {Iterator<IntCursor> iterator() { ... }

}

Iterating over list elements in HPPC

for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);

}

...or

list.forEach(new IntProcedure() {public void apply(int value) {

System.out.println(value);}

});

...or

final int [] buffer = list.buffer;final int size = list.size();

for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);

}

Iterating over list elements in HPPC

for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);

}

...or

list.forEach(new IntProcedure() {public void apply(int value) {

System.out.println(value);}

});

...or

final int [] buffer = list.buffer;final int size = list.size();

for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);

}

Iterating over list elements in HPPC

for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);

}

...or

list.forEach(new IntProcedure() {public void apply(int value) {

System.out.println(value);}

});

...or

final int [] buffer = list.buffer;final int size = list.size();

for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);

}

The fastest one?

What’s in HPPC?

Open implementation isgood.

/*** Applies a supplemental hash function to a given* hashCode, which defends against poor quality* hash functions. [...]*/

static int hash(int h) {// This function ensures that hashCodes that differ only by// constant multiples at each bit position have a bounded// number of collisions (approximately 8 at default load factor).h ^= (h >>> 20) ^ (h >>> 12);return h ^ (h >>> 7) ^ (h >>> 4);

}

HashMap rehashes your (carefully crafted) hash code.

HPPC approach (example):

public class LongIntOpenHashMap implements LongIntMap {// ...public LongIntOpenHashMap(int initialCapacity, float loadFactor,

LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) {// ...

}

Defaults: LongMurmurHash, IntHashFunction.

Example 7

Frequency count of character bigrams in a given text.

• HPPC:

final char [] CHARS = DATA;final IntIntOpenHashMap counts = new IntIntOpenHashMap();for (int i = 0; i < CHARS.length - 1; i++) {

counts.putOrAdd((CHARS[i] << 16 | CHARS[i + 1]), 1, 1);}

• JCF, boxed integer types.

final Integer currentCount = map.get(bigram);map.put(bigram, currentCount == null ? 1 : currentCount + 1);

• JCF, with IntHolder (mutable value object).

• GNU Trove

map.adjustOrPutValue(bigram, 1, 1);

• fastutil, OpenHashMap and LinkedOpenHashMap

map.put(bigram, map.get(bigram) + 1);

• PCJ, OpenHashMap and ChainedHashMap

Is Java faster than C/C++?The short answer is: it depends.

— Cliff Click

Example 8

The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2 java 1.6.0_20-64

real

63.850s 43.197s

user

63.110s 46.370s

sys

0.240s 0.840s

Example 8

The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2 java 1.6.0_20-64

real

63.850s 43.197s

user

63.110s 46.370s

sys

0.240s 0.840s

Example 8

The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2 java 1.6.0_20-64

real 63.850s

43.197s

user 63.110s

46.370s

sys 0.240s

0.840s

Example 8

The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2 java 1.6.0_20-64

real 63.850s 43.197suser 63.110s 46.370ssys 0.240s 0.840s

Summary and Conclusions

Performance checklist(sanity check)

• Algorithms, algorithms, algorithms.

• Proper data structures.

• Spurious GC activity.

• Memory barriers in tight loops.

• CPU cache utilization.

• Low-level, hotspot-specific code structuring.

HPPC and junit-benchmarks are at:http://labs.carrotsearch.com