Highly Scalable Java Programming for Multi-Core System
-
Upload
zhi-gan -
Category
Technology
-
view
6.639 -
download
2
description
Transcript of Highly Scalable Java Programming for Multi-Core System
Zhi Gan ([email protected])
http://ganzhi.blogspot.com
Highly Scalable Java Programming Highly Scalable Java Programming for Multifor Multi--Core SystemCore System
Agenda
• Software Challenges
• Profiling Tools Introduction
• Best Practice for Java Programming
• Rocket Science: Lock-Free Programming
2
Software challenges• Parallelism
– Larger threads per system = more parallelism needed to achieve high utilization
– Thread-to-thread affinity (shared code and/or data)
• Memory management– Sharing of cache and memory bandwidth across more threads =
greater need for memory efficiency– Thread-to-memory affinity (execute thread closest to associated
data)
• Storage management– Allocate data across DRAM, Disk & Flash according to access
frequency and patterns
3
Typical Scalability Curve
The 1st Step: Profiling Parallel Application
Important Profiling Tools• Java Lock Monitor (JLM)
– understand the usage of locks in their applications – similar tool: Java Lock Analyzer (JLA)
• Multi-core SDK (MSDK)– in-depth analysis of the complete execution stack
• AIX Performance Tools – Simple Performance Lock Analysis Tool (SPLAT) – XProfiler– prof, tprof and gprof
Tprof and VPA tool
Java Lock Monitor
• %MISS : 100 * SLOW / NONREC• GETS : Lock Entries• NONREC : Non Recursive Gets• SLOW : Non Recursives that Wait• REC : Recursive Gets• TIER2 : SMP: Total try-enter spin loop cnt (middle for 3
tier)• TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier)• %UTIL : 100 * Hold-Time / Total-Time• AVER-HTM : Hold-Time / NONREC
Multi-core SDKDead Lock View
Synchronization View
Best Practice for High Scalable Java Programming
What Is Lock Contention?
From JLM tool website
Lock Operation Itself Is Expensive
• CAS operations are predominantly used for locking
• it takes up a big part of the execution time
Reduce Locking Scope
public synchronized void foo1(int k) { String key = Integer.toString(k);String value = key+"value"; if (null == key){
return ; }else {
maph.put(key, value); }
}
Execution Time: 16106 milliseconds
public void foo2(int k) { String key = Integer.toString(k); String value = key+"value"; if (null == key){
return ; }else{
synchronized(this){ maph.put(key, value);
} }
}
Execution Time: 12157 milliseconds
25%
Results from JLM report
Reduced AVER_HTM
Lock Splittingpublic synchronized void
addUser1(String u) { users.add(u);
}
public synchronized void addQuery1(String q) { queries.add(q);
}
Execution Time: 12981 milliseconds
public void addUser2(String u){ synchronized(users){
users.add(u); }
} public void addQuery2(String q){
synchronized(queries){ queries.add(q);
} }
Execution Time: 4797 milliseconds
64%
Result from JLM report
Reduced lock tries
Lock Striping
public synchronized void put1(int indx, String k) { share[indx] = k;
}
Execution Time: 5536 milliseconds
public void put2(int indx, String k) { synchronized (locks[indx%N_LOCKS]) {
share[indx] = k; }
}
Execution Time: 1857 milliseconds
66%
Result from JLM report
More locks with less AVER_HTM
Split Hot Points : Scalable Counter
– ConcurrentHashMap maintains a independent counter for each segment of hash map, and use a lock for each counter
– get global counter by sum all independent counters
Alternatives of Exclusive Lock
• Duplicate shared resource if possible• Atomic variables
– counter, sequential number generator, head pointer of linked-list
• Concurrent container– java.util.concurrent package, Amino lib
• Read-Write Lock– java.util.concurrent.locks.ReadWriteLock
Example of AtomicLongArraypublic synchronized void set1(int
idx, long val) { d[idx] = val;
}
public synchronized long get1(int idx) { long ret = d[idx]; return ret;
}
Execution Time: 23550 milliseconds
private final AtomicLongArray a;
public void set2(int idx, long val) { a.addAndGet(idx, val);
}
public long get2(int idx) { long ret = a.get(idx); return ret;
}
Execution Time: 842 milliseconds
96%
Using Concurrent Container• java.util.concurrent package
– since Java1.5 – ConcurrentHashMap, ConcurrentLinkedQueue,
CopyOnWriteArrayList, etc• Amino Lib is another good choice
– LockFreeList, LockFreeStack, LockFreeQueue, etc• Thread-safe container• Optimized for common operations• High performance and scalability for multi-core
platform• Drawback: without full feature support
Using Immutable and Thread Local data
• Immutable data – remain unchanged in its life cycle – always thread-safe
• Thread Local data– only be used by a single thread– not shared among different threads– to replace global waiting queue, object pool– used in work-stealing scheduler
Reduce Memory Allocation
• JVM: Two level of memory allocation– firstly from thread-local buffer– then from global buffer
• Thread-local buffer will be exhausted quickly if frequency of allocation is high
• ThreadLocal class may be helpful if temporary object is needed in a loop
Rocket Science: Lock-Free Programming
Using Lock-Free/Wait-Free Algorithm
• Lock-Free allow concurrent updates of shared data structures without using any locking mechanisms– solves some of the basic problems associated
with using locks in the code– helps create algorithms that show good
scalability • Highly scalable and efficient • Amino Lib
Why Lock-Free Often Means Better Scalability? (I)
Lock:All threads wait for oneLock free: No wait, but only one can succeed,
Other threads need retry
Why Lock-Free Often Means Better Scalability? (II)
Lock:All threads wait for oneLock free: No wait, but only one can succeed,
Other threads often need to retry
XX
Performance of A Lock-Free Stack
Picture from: http://www.infoq.com/articles/scalable-java-components
References
• Amino Lib – http://amino-cbbs.sourceforge.net/
• MSDK – http://www.alphaworks.ibm.com/tech/msdk
• JLA– http://www.alphaworks.ibm.com/tech/jla
Backup