Java in High-Performance Computing€¦ · A lotgoing on under the (VM) hood. Bad code may work,...

Java in High-Performance Computing

Dawid Weiss

Carrot SearchInstitute of Computing Science, Poznan University of Technology

GeeCon Poznan, 05/2010

Learn from the mistakes of others. You can’t live longenough to make them all yourself.

— Eleanor Roosevelt

Talk outline

• What is “High performance”?

• What is “Java”?

• Measuring performance (benchmarking).

• HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.

Talk outline

• What is “High performance”?

• What is “Java”?

• Measuring performance (benchmarking).

• HPPC library.

Crosscutting: (un?)common pitfalls and performance killers. SomeHotSpot internals.

Divide-and-conquerstyle algorithm

for (Example e : examples) {e.hasQuiz() ? e.showQuiz() : e.showCode();e.explain();e.deriveConclusions();

— PART I —

High PerformanceComputing

High-performance computing (HPC) usessupercomputers and computer clusters to solveadvanced computation problems.

— Wikipedia

Is Java faster than C/C++?The short answer is: it depends.

— Cliff Click

It’s usually hard to makea fast program run faster.

It’s easy to make a slowprogram run even slower.

It’s easy to make fasthardware run slow.

For now, HPC

• limited allowed computation time,

• constrained resources (hardware, memory).

Good HPC software ∝ no (obvious) flaws.

For now, HPC

• limited allowed computation time,

• constrained resources (hardware, memory).

Good HPC software ∝ no (obvious) flaws.

— PART II —

What is Java?

(Recall: Is Java faster than C/C++?)

Example 1

public void testSum1() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);result = sum;

where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .

Example 1

where the body of sum1 and sum2 sums arguments and returns theresult and COUNT is significantly large. . .

VM sum1 sum2

sun-1.6.0-20

0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200).

VM sum1 sum2

sun-1.6.0-20 0.04

2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16

0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18

0.04 3.29ibm-1.6.2 0.08 6.28

jrockit-27.5.0 0.18 0.16harmony-r917296 0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2

0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0

0.18 0.16harmony-r917296 0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296

0.17 0.35

VM sum1 sum2

sun-1.6.0-20 0.04 2.62sun-1.6.0-16 0.04 3.20sun-1.5.0-18 0.04 3.29

ibm-1.6.2 0.08 6.28jrockit-27.5.0 0.18 0.16

harmony-r917296 0.17 0.35

VM sum1 sum2 sum3 sum4

sun-1.6.0-20 0.04 2.62 1.05 3.76sun-1.6.0-16 0.04 3.20 1.39 4.99sun-1.5.0-18 0.04 3.29 1.46 5.20

ibm-1.6.2 0.08 6.28 0.16 14.64jrockit-27.5.0 0.18 0.16 1.16 3.18

harmony-r917296 0.17 0.35 9.18 22.49

int sum1(int a, int b) {return a + b;

Integer sum2(Integer a, Integer b) {return a + b;

Integer sum2(Integer a, Integer b) {return Integer.valueOf(

a.intValue() + b.intValue());}

int sum3(int... args) {int sum = 0;for (int i = 0; i < args.length; i++)

sum += args[i];return sum;

Integer sum4(Integer... args) {int sum = 0;for (int i = 0; i < args.length; i++) {

sum += args[i];}return sum;

Integer sum4(Integer [] args) {// ...

Conclusions

• Syntactic sugar may be costly.

• Primitive types are fast.

• Large differences between different VMs.

Example 2

Write once, run anywhere!

But it’s the same VM!

It works on my machine!

private static boolean ready;

public static void startThread() {new Thread() {

public void run() {try {

sleep(2000);} catch (Exception e) { /* ignore */ }System.out.println("Marking loop exit.");ready = true;

}}.start();

public static void main(String[] args) {startThread();System.out.println("Entering the loop...");while (!ready) {

// Do nothing.}System.out.println("Done, I left the loop!");

while (!ready) {// Do nothing.

boolean r = ready;while (!r) {

// Do nothing.}

In most cases true, from a JMM perspective.

while (!ready) {// Do nothing.

boolean r = ready;while (!r) {

// Do nothing.}

In most cases true, from a JMM perspective.

JVM Internals. . .

• fast

• not (much) optimization

• slow(er) than C1

• a lot of JMM-allowed optimizations

There are hundreds of JVMtuning/diagnostic switches.

My personal favorite:

Conclusions

• Bytecode is far from what is executed.

• A lot going on under the (VM) hood.

• Bad code may work, but will eventually crash.

• HotSpot-level optimizations are good.

• If there is a bug in the HotSpot compiler. . .

Conclusions

• Bytecode is far from what is executed.

• A lot going on under the (VM) hood.

• Bad code may work, but will eventually crash.

• HotSpot-level optimizations are good.

• If there is a bug in the HotSpot compiler. . .

Any other diversifyingfactors?

• more VM vendors,

• hardware diversity,

• software and hardware quirks.

Non-JVM target platforms

• Dalvik

• GWT

• IKVM

Conclusions

• There is no “single” Java performance model.

• Performance depends on the VM,environment, class library, hardware.

• Apply benchmark-and-correct cycle.

Benchmarking

Example 3

public void testSum1_2() {int sum = 0;for (int i = 0; i < COUNT; i++)

sum += sum1(i, i);}

VM sum1 sum1_2

sun-1.6.0-20

0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

VM sum1 sum1_2

sun-1.6.0-20 0.04

0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

VM sum1 sum1_2

sun-1.6.0-20 0.04 0.00

sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

VM sum1 sum1_2

sun-1.6.0-20 0.04 0.00sun-1.6.0-16 0.04 0.00sun-1.5.0-18 0.04 0.00

ibm-1.6.2 0.08 0.01jrockit-27.5.0 0.17 0.08

harmony-r917296 0.17 0.11

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’

...010 pushq rbp

subq rsp, #16 # Create framenop # nop for patch_verified_entry

016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

021 ret

java -server -XX:+PrintOptoAssembly -XX:+PrintCompilation ...

- method holder: ’com/dawidweiss/geecon2010/Example03’- access: 0xc1000001 public- name: ’testSum1_2’

...010 pushq rbp

subq rsp, #16 # Create framenop # nop for patch_verified_entry

016 addq rsp, 16 # Destroy framepopq rbptestl rax, [rip + #offset_to_poll_page] # Safepoint: poll for GC

021 ret

Conclusions

• Benchmarks must be executed to providefeedback.

• HotSpot is smart and effective at removingdead code.

Example 4

@Testpublic void testAdd1() {

int sum = 0;for (int i = 0; i < COUNT; i++) {

sum += add1(i);}guard = sum;

public int add1(int i) {return i + 1;

Note add1 is virtual.

switch testAdd1

-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining ?

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

switch testAdd1

-XX:+Inlining -XX:+PrintInlining 0.04-XX:-Inlining 0.45

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Ubuntu, dual-core AMD Athlon 5200, JRE 1.7b80-debug).

Most Java calls aremonomorphic.

HotSpot adjusts tomegamorphic calls

automatically.

Example 5

abstract class Superclass {abstract int call();

class Sub1 extends Superclass{ int call() { return 1; } }

Superclass[] mixed =initWithRandomInstances(10000);

Superclass[] solid =initWithSub1Instances(10000);

@Testpublic void testMonomorphic() {

int sum = 0;int m = solid.length;for (int i = 0; i < COUNT; i++)

sum += solid[i % m].call();guard = sum;

@Testpublic void testMegamorphic() {

int sum = 0;int m = mixed.length;for (int i = 0; i < COUNT; i++)

sum += mixed[i % m].call();guard = sum;

VM monomorphic megamorphic

sun-1.6.0-20 0.19 0.32sun-1.6.0-16 0.19 0.34sun-1.5.0-18 0.18 0.34

ibm-1.6.2 0.20 0.30jrockit-27.5.0 0.22 0.29

harmony-r917296 0.27 0.32

Example 6

@Testpublic void testBitCount1() {

int sum = 0;for (int i = 0; i < COUNT; i++)

sum += Integer.bitCount(i);guard = sum;

@Testpublic void testBitCount2() {

int sum = 0;for (int i = 0; i < COUNT; i++)

sum += bitCount(i);guard = sum;

/* Copied from* {@link Integer#bitCount}*/

static int bitCount(int i) {// HD, Figure 5-2i = i - ((i >>> 1)

& 0x55555555);i = (i & 0x33333333)

+ ((i >>> 2) & 0x33333333);i = (i + (i >>> 4))

& 0x0f0f0f0f;i = i + (i >>> 8);i = i + (i >>> 16);return i & 0x3f;

VM testBitCount1 testBitCount2

sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43

sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

sun-1.6.0-20 0.43 0.43sun-1.7.0-b80 0.43 0.43

sun-1.6.0-20 0.08 0.33sun-1.7.0-b83 0.07 0.32

(averages in sec., 10 measured rounds, 5 warmup, 64-bit Windows 7, Intel I7 860).

... -XX:+PrintInlining ...

...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)

Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...

... -XX:+PrintInlining ...

...Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Inlining intrinsic _bitCount_i at bci:9 in ..Example06::testBitCount1Example06.testBitCount1: [measured 10 out of 15 rounds]round: 0.07 [+- 0.00], round.gc: 0.00 [+- 0.00] ...

@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)@ 9 com.dawidweiss.geecon2010.Example06::bitCount inline (hot)

Example06.testBitCount2: [measured 10 out of 15 rounds]round: 0.32 [+- 0.01], round.gc: 0.00 [+- 0.00] ...

... -XX:+PrintOptoAssembly ...

{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1

...0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9

... -XX:+PrintOptoAssembly ...

{method}- klass: {other class}- method holder: com/dawidweiss/geecon2010/Example06- name: testBitCount1

...0c2 B13: # B12 B14 <- B8 B12 Loop: B13-B12 inner stride: ...0c2 movl R10, RDX # spill...0e1 movl [rsp + #40], R11 # spill0e6 popcnt R8, R8...0f5 addl R9, #7 # int0f9 popcnt R11, R110fe popcnt RCX, R9

Conclusions

• Benchmarks must be statistically sound.→ averages, variance, min, max, warm-up phase

• Account for HotSpot optimisations.

• Account for hardware differences.→ test-on-target

• Use domain data and real scenarios.

• Inspect suspicious output with debug JVM.

See more: Cliff Click, http://java.sun.com/javaone/2009/articles/rockstar_click.jsp.

HPPCHigh Performance Primitive Collections

Motivation

• Primitive types: fast and memory-friendly.

• Optional assertions.

• Single-threaded. No fail-fast.

• Fast, fast, fast iterators, with no GC overhead.

• Open internals (explicit implementation).

• Programmers know what they’re doing.

Why not JCF?

public interface List<E> extends Collection<E> {boolean contains(Object o); // [-] contract-enforced methodsIterator<E> iterator(); // [-] iterators over primitive types?Object[] toArray(); // [-] troublesome covariants...

Friendly Competition• fastutil

• PCJ

• GNU Trove

• Apache Mahout (ported COLT)

• Apache Primitive Collections

All of these have pros and cons and deal with JCF compatibilitysomehow.

Iterators in fastutil or PCJ

interface IntIterator extends Iterator<Integer> {// Primitive-specific methodint nextInt();

Iterators in HPPC

public final class IntCursor {public int index;public int value;

public class IntArrayList extends Iterable<IntCursor> {Iterator<IntCursor> iterator() { ... }

Iterating over list elements in HPPC

for (IntCursor c : list) {System.out.println(c.index + ": " + c.value);

list.forEach(new IntProcedure() {public void apply(int value) {

System.out.println(value);}

final int [] buffer = list.buffer;final int size = list.size();

for (int i = 0; i < size; i++) {System.out.println(i + ": " + buffer[i]);

The fastest one?

What’s in HPPC?

Open implementation isgood.

/*** Applies a supplemental hash function to a given* hashCode, which defends against poor quality* hash functions. [...]*/

static int hash(int h) {// This function ensures that hashCodes that differ only by// constant multiples at each bit position have a bounded// number of collisions (approximately 8 at default load factor).h ^= (h >>> 20) ^ (h >>> 12);return h ^ (h >>> 7) ^ (h >>> 4);

HashMap rehashes your (carefully crafted) hash code.

HPPC approach (example):

public class LongIntOpenHashMap implements LongIntMap {// ...public LongIntOpenHashMap(int initialCapacity, float loadFactor,

LongHashFunction keyHashFunction, IntHashFunction valueHashFunction) {// ...

Defaults: LongMurmurHash, IntHashFunction.

Example 7

Frequency count of character bigrams in a given text.

• HPPC:

final char [] CHARS = DATA;final IntIntOpenHashMap counts = new IntIntOpenHashMap();for (int i = 0; i < CHARS.length - 1; i++) {

counts.putOrAdd((CHARS[i] << 16 | CHARS[i + 1]), 1, 1);}

• JCF, boxed integer types.

final Integer currentCount = map.get(bigram);map.put(bigram, currentCount == null ? 1 : currentCount + 1);

• JCF, with IntHolder (mutable value object).

• GNU Trove

map.adjustOrPutValue(bigram, 1, 1);

• fastutil, OpenHashMap and LinkedOpenHashMap

map.put(bigram, map.get(bigram) + 1);

• PCJ, OpenHashMap and ChainedHashMap

Is Java faster than C/C++?The short answer is: it depends.

— Cliff Click

Example 8

The same algorithm for building a DFSA automaton accepting aset of strings. Input: 3 565 575 strings, 158M of text.

gcc -O2 java 1.6.0_20-64

63.850s 43.197s

63.110s 46.370s

0.240s 0.840s

Example 8

gcc -O2 java 1.6.0_20-64

63.850s 43.197s

63.110s 46.370s

0.240s 0.840s

Example 8

gcc -O2 java 1.6.0_20-64

real 63.850s

43.197s

user 63.110s

46.370s

sys 0.240s

0.840s

Example 8

gcc -O2 java 1.6.0_20-64

real 63.850s 43.197suser 63.110s 46.370ssys 0.240s 0.840s

Summary and Conclusions

Performance checklist(sanity check)

• Algorithms, algorithms, algorithms.

• Proper data structures.

• Spurious GC activity.

• Memory barriers in tight loops.

• CPU cache utilization.

• Low-level, hotspot-specific code structuring.

HPPC and junit-benchmarks are at:http://labs.carrotsearch.com

Java in High-Performance Computing€¦ · A lotgoing on under the (VM) hood. Bad code may work,...

Documents

Transcript of Java in High-Performance Computing€¦ · A lotgoing on under the (VM) hood. Bad code may work,...

Azul Zing Java performance eBook

Java Performance Boost

Performance Tuning Java Applications

Java Application Performance and Analytics

Java Platform Performance

Java Performance Tips

Java Performance Tweaks

92209431 Java Performance

Java™ SE Performance Tuning

Java Performance and Using Java Flight Recorder

CPU Performance in Java.

1. Physical Introduction - planet.com.tw Series_v1... · 4. Web Management The following shows how to start up the Web Management of the IKVM-210 Series. Note the IKVM-210 Series

Java Performance Tuning Tech

Topaz for Java Performance Webcast

Java Performance and Profiling

Java performance

Java Performance Tuning 2504

Debugging Java Performance

Java Performance Mistakes

SevOne - Java Performance Visibility