Highly Scalable Java Programming for Multi-Core System

Post on 28-Nov-2014

6.639 views 2 download

description

This is a list of java programming skill which can be used to improve scalability of Java application.

Transcript of Highly Scalable Java Programming for Multi-Core System

Zhi Gan (ganzhi@gmail.com)

http://ganzhi.blogspot.com

Highly Scalable Java Programming Highly Scalable Java Programming for Multifor Multi--Core SystemCore System

Agenda

• Software Challenges

• Profiling Tools Introduction

• Best Practice for Java Programming

• Rocket Science: Lock-Free Programming

2

Software challenges• Parallelism

– Larger threads per system = more parallelism needed to achieve high utilization

– Thread-to-thread affinity (shared code and/or data)

• Memory management– Sharing of cache and memory bandwidth across more threads =

greater need for memory efficiency– Thread-to-memory affinity (execute thread closest to associated

data)

• Storage management– Allocate data across DRAM, Disk & Flash according to access

frequency and patterns

3

Typical Scalability Curve

The 1st Step: Profiling Parallel Application

Important Profiling Tools• Java Lock Monitor (JLM)

– understand the usage of locks in their applications – similar tool: Java Lock Analyzer (JLA)

• Multi-core SDK (MSDK)– in-depth analysis of the complete execution stack

• AIX Performance Tools – Simple Performance Lock Analysis Tool (SPLAT) – XProfiler– prof, tprof and gprof

Tprof and VPA tool

Java Lock Monitor

• %MISS : 100 * SLOW / NONREC• GETS : Lock Entries• NONREC : Non Recursive Gets• SLOW : Non Recursives that Wait• REC : Recursive Gets• TIER2 : SMP: Total try-enter spin loop cnt (middle for 3

tier)• TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier)• %UTIL : 100 * Hold-Time / Total-Time• AVER-HTM : Hold-Time / NONREC

Multi-core SDKDead Lock View

Synchronization View

Best Practice for High Scalable Java Programming

What Is Lock Contention?

From JLM tool website

Lock Operation Itself Is Expensive

• CAS operations are predominantly used for locking

• it takes up a big part of the execution time

Reduce Locking Scope

public synchronized void foo1(int k) { String key = Integer.toString(k);String value = key+"value"; if (null == key){

return ; }else {

maph.put(key, value); }

}

Execution Time: 16106 milliseconds

public void foo2(int k) { String key = Integer.toString(k); String value = key+"value"; if (null == key){

return ; }else{

synchronized(this){ maph.put(key, value);

} }

}

Execution Time: 12157 milliseconds

25%

Results from JLM report

Reduced AVER_HTM

Lock Splittingpublic synchronized void

addUser1(String u) { users.add(u);

}

public synchronized void addQuery1(String q) { queries.add(q);

}

Execution Time: 12981 milliseconds

public void addUser2(String u){ synchronized(users){

users.add(u); }

} public void addQuery2(String q){

synchronized(queries){ queries.add(q);

} }

Execution Time: 4797 milliseconds

64%

Result from JLM report

Reduced lock tries

Lock Striping

public synchronized void put1(int indx, String k) { share[indx] = k;

}

Execution Time: 5536 milliseconds

public void put2(int indx, String k) { synchronized (locks[indx%N_LOCKS]) {

share[indx] = k; }

}

Execution Time: 1857 milliseconds

66%

Result from JLM report

More locks with less AVER_HTM

Split Hot Points : Scalable Counter

– ConcurrentHashMap maintains a independent counter for each segment of hash map, and use a lock for each counter

– get global counter by sum all independent counters

Alternatives of Exclusive Lock

• Duplicate shared resource if possible• Atomic variables

– counter, sequential number generator, head pointer of linked-list

• Concurrent container– java.util.concurrent package, Amino lib

• Read-Write Lock– java.util.concurrent.locks.ReadWriteLock

Example of AtomicLongArraypublic synchronized void set1(int

idx, long val) { d[idx] = val;

}

public synchronized long get1(int idx) { long ret = d[idx]; return ret;

}

Execution Time: 23550 milliseconds

private final AtomicLongArray a;

public void set2(int idx, long val) { a.addAndGet(idx, val);

}

public long get2(int idx) { long ret = a.get(idx); return ret;

}

Execution Time: 842 milliseconds

96%

Using Concurrent Container• java.util.concurrent package

– since Java1.5 – ConcurrentHashMap, ConcurrentLinkedQueue,

CopyOnWriteArrayList, etc• Amino Lib is another good choice

– LockFreeList, LockFreeStack, LockFreeQueue, etc• Thread-safe container• Optimized for common operations• High performance and scalability for multi-core

platform• Drawback: without full feature support

Using Immutable and Thread Local data

• Immutable data – remain unchanged in its life cycle – always thread-safe

• Thread Local data– only be used by a single thread– not shared among different threads– to replace global waiting queue, object pool– used in work-stealing scheduler

Reduce Memory Allocation

• JVM: Two level of memory allocation– firstly from thread-local buffer– then from global buffer

• Thread-local buffer will be exhausted quickly if frequency of allocation is high

• ThreadLocal class may be helpful if temporary object is needed in a loop

Rocket Science: Lock-Free Programming

Using Lock-Free/Wait-Free Algorithm

• Lock-Free allow concurrent updates of shared data structures without using any locking mechanisms– solves some of the basic problems associated

with using locks in the code– helps create algorithms that show good

scalability • Highly scalable and efficient • Amino Lib

Why Lock-Free Often Means Better Scalability? (I)

Lock:All threads wait for oneLock free: No wait, but only one can succeed,

Other threads need retry

Why Lock-Free Often Means Better Scalability? (II)

Lock:All threads wait for oneLock free: No wait, but only one can succeed,

Other threads often need to retry

XX

Performance of A Lock-Free Stack

Picture from: http://www.infoq.com/articles/scalable-java-components

References

• Amino Lib – http://amino-cbbs.sourceforge.net/

• MSDK – http://www.alphaworks.ibm.com/tech/msdk

• JLA– http://www.alphaworks.ibm.com/tech/jla

Backup