Optimizing your java applications for multi core hardware
-
Upload
indicthreads -
Category
Technology
-
view
5.731 -
download
5
description
Transcript of Optimizing your java applications for multi core hardware
![Page 1: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/1.jpg)
1
Optimizing your Java Applications for multi-core
hardware
Prashanth K [email protected]
Java TechnologiesIBM
![Page 2: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/2.jpg)
2
Agenda
• Evolution of Processor Architecture
• Why should I care?
• Think about Scalability
• How to exploit
• Parallelism in Java
• JVM optimizations for multi-core scalability
![Page 3: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/3.jpg)
3
As The World Gets Smarter, Demands On IT Will Grow
Smart energy grids
Smart healthcare
Smart food systems
Intelligent oil field technologies
Smart supply chains
Smart retail
•IT infrastructure must grow to meet these demands
global scope, processing scale, efficiency
Digital data is projected to grow tenfold from 2007
to 2011.
Devices will be connected to
the internet by 2011
1 Trillion 70¢ per $1Global trading
systems are under extreme stress,
handling billions of market data
messages each day
25 Billion70% on average is
spent on maintaining current
IT infrastructure versus adding new
capabilities
10x
![Page 4: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/4.jpg)
4
Hardware Trends
Increasing transistor density
Clock Speed leveling off
More number of cores
Non-Uniform Memory Access
Main memory getting larger
![Page 5: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/5.jpg)
5
In 2010 POWER Systems Brings Massive Parallelism
2001180 nm
2004130 nm
200765 nm
201045 nm
POWER7™ 4 threads/core 8 cores/chip
32 sockets/server
1024 threads
POWER6™ 2 threads/core 2 cores/chip
32 sockets/server
128 threads
POWER5™ 2 threads/core 2 cores/chip
32 sockets/server
128 threads
POWER4™ 1 thread/core 2 cores/chip
16 sockets/server
32 threads
Thre
ads
![Page 6: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/6.jpg)
6
Agenda
• Evolution of Processor Architecture
• Why should I care?
• Think about Scalability
• How to exploit
• Parallelism in Java
• JVM optimizations for multi-core scalability
![Page 7: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/7.jpg)
7
Why should I care?
Your application may be re-usedBetter performance
Better leverage additional resourcesCores, hardware threads, memory etc
![Page 8: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/8.jpg)
8
Think about scalability
• Serial bottlenecks inhibit scalability
• Organize your application into parallel tasks– Consider TaskExecutor API– Too many threads can be just as bad as too few
• Do not rely on JVM to discover opportunities– No automatic parallelization – Java class libraries do not exploit vector
processor capabilities
![Page 9: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/9.jpg)
9
Think about scalability
• Load imbalance– Workload not evenly distributed
– Consider breaking large tasks into smaller ones
– Change serial algorithms to parallel ones
• Tracing and I/O– Bottleneck unless infrequent updates or log is
striped (RAID)
– Blocking disk/console I/O inhibit scalability
![Page 10: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/10.jpg)
10
Synchronization and locking
• J9's Three-tiered locking– Spin– Yield– OS
• Avoid synchronization in static methods• Consider breaking long synchronized blocks into
several smaller ones– May be bad if results in many context switches
• Java Lock Monitor (JLM) tool can helphttp://perfinsp.sourceforge.net/jlm.html
![Page 11: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/11.jpg)
11
Synchronization and locking
• Volatiles– Compiler will not cache the value– Creates memory barrier
• Avoid synchronized container classes– Building scalable data structures is
difficult– Use java.util.concurrent (j/u/c)
• Non-blocking object access– Possible with j/u/c
![Page 12: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/12.jpg)
12
Agenda
• Evolution of Processor Architecture
• Why should I care?
• Think about Scalability
• How to exploit
• Parallelism in Java
• JVM optimizations for multi-core scalability
![Page 13: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/13.jpg)
13
java.util.concurrent package
• Introduced in Java SE 5 • Alternative strong synchronization• Lighter weight, better scalability
– Comparing to intrinsic locks
• java.util.concurrent.atomic.*• java.util.concurrent.locks.*• ConcurrentCollections• Synchronizers• TaskExecutor
![Page 14: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/14.jpg)
14
j/u/c/atomic.*
• Atomic primitives
• Strong form of synchronization
– But does not use lock – non blocking
– Exploit atomic instructions such as compare-and-swap in hardware
• Supports compounded actions• AtomicLongFieldUpdater• AtomicMarkableReference• AtomicReference• AtomicReferenceArray• AtomicReferenceFieldUpdater• AtomicStampedReference
• AtomicBoolean• AtomicInteger• AtomicIntegerArray• AtomicIntegerFieldUpdater• AtomicLong• AtomicLongArray
![Page 15: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/15.jpg)
15
j/u/c/atomic.*
• Getter and setters– get
– set
– lazySet
• Updates– getAndSet
– getAndAdd/getAndIncrement/getAndDecrement
– addAndGet/incrementAndGet/decrementAndGet
• CAS– compareAndSet/weakCompareAndSet
• Conversions– toString, intValue, longValue, floatValue, doubleValue
![Page 16: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/16.jpg)
16
j/u/c/locks.*
• Problems with intrinsic locks– Impossible to back off from a lock attempt
• Deadlock
– Lack of features• Read vs write• Fairness policies
– Block-structured• Must lock and release in the same method
• j/u/c/locks– Greater flexibility for locks and conditions
• Non-block-structured
– Provides reader-writer locks• Why block other readers?• Better scalability
![Page 17: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/17.jpg)
17
j/u/c/locks.*
Interfaces:– Condition
– Lock
– ReadWriteLock
Classes:– ReentrantLock
– ReentrantReadWriteLock
– LockSupport
– AbstractQueuedSynchronizer
![Page 18: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/18.jpg)
18
j/u/c.* - Concurrent Collections
Concurrent, thread safe implementations of several collections
HashMap → ConcurrentHashMap
TreeMap → ConcurrentSkipListMap
ArrayList → CopyOnWriteArrayList
ArraySet → CopyOnWriteArraySet
Queues → ConcurrentLinkedQueue or one of the blocking queues
![Page 19: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/19.jpg)
19
Strains on the VM
• Excessive use of temporary memory can lead to increased garbage collector activity– Stop the world GC pauses the application
• Excessive class loading– Updating class hierarchy
– Invalidating JIT optimizations
– Consider creating a “startup” phase
• Transitions between Java and native code– VM access lock
![Page 20: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/20.jpg)
20
Memory Footprint
• Little control over object allocation in Java– Small short lived objects are easier to cache
– Large long lived objects likely to cause cache misses
– Memory Analysis Tool (MAT) can help
• Consider using large pages for TLB misses– -Xlp, requires OS support
• Tune your heap settings– Heap lock contention with flat heap
![Page 21: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/21.jpg)
21
Affinitizing JVMs
• Can exploit cache hierarchy on a subset of cores
• JVM working set can fit within the physical memory of a single node in a NUMA system
• Linux: taskset, numactl
• Windows: start
![Page 22: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/22.jpg)
22
Is my application scalable?
• Low CPU means resources are not maximized– Evaluate if application has too few/many threads
– Locks and synchronization
– Network connections, I/O
– Thrashing • working set is too large for physical memory
• High CPU is generally good, as long as resources are spent in application threads, doing meaningful work
• Evaluate where time is being spent– Garbage collection
– VM/JIT
– OS Kernel functions
– Other processes
• Tune, tune, tune
![Page 23: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/23.jpg)
23
Write Once, Tune Everywhere
• HealthCenter, GCMV, MAThttp://www.ibm.com/developerworks/java/jdk/tools/
• Dependence on operating System– Memory allocation– Socket layer
• Tune for hardware capabilities– How many cores? How much memory?– What is the limit on network access?– Are there storage bottlenecks?
![Page 24: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/24.jpg)
24
Agenda
• Evolution of Processor Architecture
• Why should I care?
• Think about Scalability
• How to exploit
• Parallelism in Java
• JVM optimizations for multi-core scalability
![Page 25: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/25.jpg)
25
IBM Java Execution Model is Built for Parallelism
JIT Compiler
Garbage Collector
Application Threads
• Generates high performance code for application threads• Customizes execution to underlying hardware• Optimizes locking performance• Asynchronous compilation thread
• Java software threads are executed on multiple hardware threads• Thread safe libraries with scalable concurrency support for parallel programming
• Manages memory on behalf of the application• Must balance throughput against observed pauses• Exploits many multiple hardware threads
![Page 26: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/26.jpg)
26
Configurable Garbage Collection policies
Multiple policies to match varying user requirements– Pause time, Throughput, Memory footprint and GC overhead
All modes exploit parallel execution– Dynamic adaptation to number of available hardware cores &
threads– GC scalability independent from user application scalability– Very low overhead (<3%) on typical workloads
![Page 27: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/27.jpg)
27
How do GC policies compare? - optthruput
Time
Thread 1
Thread 2
Thread 3
Thread n
GCJava
Optimize Throughput•Highly parallel GC + streamlined application thread execution•May cause longer pause times
-Xgcpolicy:optthruput
Picture is only illustrative and doesn’t reflect any particular real-life application. The purpose is to show theoretical differences in pause times between GC policies.
![Page 28: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/28.jpg)
28
How do GC policies compare? - optavgpause
Time
GCJava
Concurrent Tracing
Optimize Pause Time•GC cleans up concurrently with application thread execution•Sacrifice some throughput to reduce average pause times
-Xgcpolicy:optavgpause
Picture is only illustrative and doesn’t reflect any particular real-life application. The purpose is to show theoretical differences in pause times between GC policies.
Thread 1
Thread 2
Thread 3
Thread n
![Page 29: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/29.jpg)
29
How do GC policies compare? - gencon
Time
Global GCJava
Concurrent Tracing
Scavenge GC
Balanced•Clean up many short-lived objects concurrent with application threads•Some pauses needed to collect longer-lived objects
-Xgcpolicy:gencon
Picture is only illustrative and doesn’t reflect any particular real-life application. The purpose is to show theoretical differences in pause times between GC policies.
Thread 1
Thread 2
Thread 3
Thread n
![Page 30: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/30.jpg)
30
How do GC policies compare? - subpools
• Uses multiple free lists•Tries to predict the size of future allocation requests based on earlier allocation requests. •Recreates free lists at the end of each GC based on these predictions. •While allocating objects on the heap, free chunks are chosen using a “best fit” method, as against the “first fit” method used in other algorithms.•Concurrent marking is disabled
Scalable•Scalable GC focused on the larger multiprocessor machines•Improved object allocation algorithm•May not be appropriate for small-to-midsize configurations
–Xgcpolicy:subpool
![Page 31: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/31.jpg)
31
JVM optimizations for multi-core scalability
Lock removal across JVM and class libraries
java.util.concurrent package optimizations
Better working set for cache efficiency
Stack allocation
Remove/optimize synchronization
Thread local storage for send/receive buffers
Non-blocking containers
Asynch JIT compilation on a separate thread
Right-sized application runtimes
![Page 32: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/32.jpg)
32
Merci
Grazie
Gracias
Obrigado
DankeJapanese
French
Russian
German
Italian
Spanish
Brazilian Portuguese
Arabic
Simplified Chinese
Traditional Chinese
Thai
Korean
Thank You
Questions?Questions?Email: Email: [email protected]@in.ibm.com
http://www.ibm.com/developerworks/java/
![Page 33: Optimizing your java applications for multi core hardware](https://reader033.fdocuments.in/reader033/viewer/2022052823/5550b799b4c905ff618b4c2d/html5/thumbnails/33.jpg)
33
Special notices
© IBM Corporation 2010. All Rights Reserved.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
The following are trademarks of the International Business Machines Corporation in the United States and/or other countries: ibm.com/legal/copytrade.shtmlAIX, CICS, CICSPlex, DataPower, DB2, DB2 Universal Database, i5/OS, IBM, the IBM logo, IMS/ESA, Power Systems, Lotus, OMEGAMON,
OS/390, Parallel Sysplex, pureXML, Rational, Redbooks, Sametime, SMART SOA, System z , Tivoli, WebSphere, and z/OS.
A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government CommerceJava and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark OfficeIntel and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries.Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.