Low latency Java apps

<Insert Picture Here>

Understanding the Java Virtual Machine and Low Latency Applications Simon Ritter Technology Evangelist

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

People Want Fast Applications

•  What does fast mean? •  “I want the answer as fast as possible” •  “I want all the answers as fast as possible”

•  These two goals are somewhat orthogonal from a programming perspective

•  One fast answer = Low Latency •  All answers as fast as possible = High Throughput

The Java Virtual Machine

•  It’s virtual, not physical •  Conversion from bytecodes to native instructions and library

calls •  Interpreted mode and Just In Time (JIT) compilation

•  Automatic memory management • new operator allocates space for an object •  Garbage collector eliminates the need for programmatic ‘free’ •  No explicit pointer manipulation (much safer)

•  Multi-threaded •  Each object has a single monitor •  Programmatic locking, some automated unlocking

Performance Considerations

Application Stack

Java Application

Application Server (Optional)

Java Virtual Machine

Operating System

Hardware (CPU/Memory/Bus)

Impact that tuning

changes will have

Memory Management

What You Need To Know About GC HotSpot VM Heap Layout

Removal planned for JDK8

What You Need To Know About GC HotSpot VM Heap Layout

Important Concepts of GC

•  Frequency of minor GCs is dictated by: •  Rate of object allocation •  Size of the Eden space

•  Frequency of object promotion into tenured space is dictated by: •  Frequency of minor GCs (how quickly objects age) •  Size of the survivor spaces •  Tenuring threshold (default 7)

•  Ideally as little data as possible should be promoted •  Involves copying, thus I/O and must be stop-the-world

Important Concepts of GC

•  Object retention impacts latency more than object allocation •  GC only visits live objects •  GC time is a function of number of live objects and graph

complexity

•  Object allocation is very cheap •  ~10 cycles in common case •  Compare to 30 cycles for fastest malloc algorithm

•  Reclaiming short lived objects is very cheap •  Weak generational hypothesis

Quick Rules of Thumb

•  Don’t be afraid to allocate quickly disposed of objects •  Especially for immediate results

•  GC loves small immutable objects and short-lived objects •  So long as they don’t survive a minor GC

•  Try to avoid complex inter-object relationships •  Reduce complexity of graph to be analysed by GC

Quick Rules of Thumb However…

•  Don’t allocate objects needlessly •  More frequent allocations means more frequent GCs •  More frequent GCs implies faster object aging •  Faster object aging means faster promotion to old generation •  Which means more frequent concurrent collections or full

compacting collections of the old generation

•  It is better to use short-lived immutable objects than long-lived mutable objects

The Ideal GC Scenario

•  After application initialization phase, only experience minor GCs and old generation growth is negligible •  Ideally, never experience need for Old Generation collection •  Minor GCs are [generally] the fastest

•  Start with Parallel GC •  i.e. -XX:+UseParallelOldGC or -XX:+UseParallelGC •  Parallel GC offers the fastest minor GC times

•  So long as you have multiple cores/CPUs

•  Move to CMS if Old Generation collection is needed •  Minor GC times will be slower due to promotion into free lists •  Hopefully this will avoid full compacting collection of old gen.

Concurrent GC Interesting Aside

•  Concurrent collectors require a write barrier to track potential hidden live objects •  The write barrier tracks all writes to objects and records the

creation and removal of references between objects

•  Write barriers introduce performance overhead •  Size of overhead depends on implementation

•  Stop-the-world GC does not require write barrier •  Hence, ideal situation is:

•  Use Parallel GC or ParallelOld GC and avoid Old Gen. collection

•  Thus avoiding full GC

GC Friendly Programming (1)

•  Large objects •  Expensive to allocate (may not use fast path, straight in Old

Gen.) •  Expensive to initialise (Java spec. requires zeroing)

•  Large objects of different size can cause heap fragmentation •  For non-compacting or partially compacting GCs

•  Avoid large object allocations (if you can) •  Especially frequent large object allocations during application

“steady state” •  Not so bad during application warm-up (pooling)


•  Data structure resizing •  Avoid resizing of array backed “container objects” •  Use the constructor that takes an explicit size parameter

•  Resizing leads to unnecessary object allocation •  Can also contribute to fragmentation (non-compacting GC)

•  Object pooling issues •  Contributes to live objects visited during GC

•  GC pause is function of number of live objects •  Access to the pool requires locking

•  Scalability issue •  Weigh against benefits of large object allocation at start-up


•  Finalizers •  Simple rule: DON’T USE THEM! •  Unless you really, really, really (and I mean REALLY) have to •  Requires at least 2 GCs cycles and GC cycles are slower •  Use a method to explicitly free resources and manage this

manually before object is no longer required

•  Reference objects •  Possible alternative to finalizers (as an almost last resort)

• SoftReference important note •  Referent is cleared by the GC, how aggressive it is at clearing

is at the mercy of the GC's implementation •  The “aggressiveness” dictates the degree of object retention

Subtle Object Retention (1)

•  Consider this class: class MyImpl extends ClassWithFinalizer { private byte[] buffer = new byte[1024 * 1024 * 2]; .... •  Consequences of finalizer in super-class •  At least 2 GC cycles to free the byte array

•  One solution class MyImpl { private ClassWithFinalier classWithFinalizer; private byte[] buffer = new byte[1024 * 1024 * 2]; ....

Subtle Object Retention (2)

•  Inner classes •  Have an implicit reference to the outer instance •  Can potentially increase object retention and graph

complexity

•  Net affect is the potential for increased GC duration •  Thus increased latency

Garbage First (G1) Garbage Collection

•  Known limitations in current GC algorithms •  CMS: No compaction, need for a remark phase •  ParallelOld: Full heap compaction, potentially long STW pauses •  Pause target can be set, but is a best-effort, no guarantees •  Problems arise with increase in heap, throughput and live set

•  G1 Collector •  Detlef, Flood, Heller, Printezis - 2004

G1 Collector

•  CMS Replacement (available JRE 7 u4 onwards) •  Server “Style” Garbage Collector •  Parallel •  Concurrent •  Generational •  Good Throughput •  Compacting •  Improved ease-of-use •  Predictable (though not hard real-time)

Main differences between

CMS and G1

G1 Collector High Level Overview

•  Region based heap •  Dynamic young generation sizing •  Partial compaction using evacuation

•  Snapshot At The Beginning (SATB) •  Avoids remark phase

•  Pause target •  Select number of regions in young and mixed collections that

fits target

•  Garbage First •  Select regions that contain mostly garbage •  Minimal work for maximal return

Colour Key

Young Generation

Old Generation

Recently Copied in Young Generation

Recently Copied in Old Generation

Non-Allocated Space

Young GCs in CMS

•  Young generation, split into •  Eden •  Survivor spaces

•  Old generation •  In-place de-allocation •  Managed by free lists

CMS

Young GCs in CMS

•  End of young generation GC

CMS

Young GCs in G1

•  Heap split into regions •  Currently 1MB regions

•  Young generation •  A set of regions

•  Old generation •  A set of regions

G1

Young GCs in G1

•  During a young generation GC •  Survivors from the young

regions are evacuated to: •  Survivor regions •  Old regions

G1

Young GCs in G1

•  End of young generation GC G1

Old GCs in CMS (Sweeping After Marking)

•  Concurrent marking phase •  Two stop-the-world pauses

•  Initial mark •  Remark

•  Marks reachable (live) objects

•  Unmarked objects are deduced to be unreachable (dead)

CMS

Old GCs in CMS (Sweeping After Marking)

•  End of concurrent sweeping phase •  All unmarked objects are de-

allocated

CMS

Old GCs in G1 (After Marking)

•  Concurrent marking phase •  One stop-the-world pause

•  Remark •  (Initial mark piggybacked

on an evacuation pause) •  Calculates liveness

information per region •  Empty regions can be

reclaimed immediately

G1


•  End of remark phase G1


•  Reclaiming old regions •  Pick regions with low live

ratio •  Collect them piggy-backed

on young GCs •  Only a few old regions

collected per such GC

G1


•  We might leave some garbage objects in the heap •  In regions with very high

live ratio •  We might collect them later

G1

CMS vs. G1 Comparison

G1

CMS

Latency Is A Key Goal

•  Oracle actively researching new ways to reduce latency and make it more predictable

•  Which direction this work goes in needs to be driven by requirements

Adaptive Compilation

JIT Compilation Facts Optimisation Decisions

•  Data: classes loaded and code paths executed •  JIT compiler does not know about all code in application

•  Unlike traditional compiler •  Optimisation decisions based on runtime history

•  No potential to predict future profile •  Decisions made may turn out to be sub-optimal later

•  Limits some types of optimisations used •  As profile changes JIT needs to react

•  Throw away compiled code no longer required •  Re-optimise based on new profile

JIT Compilation Facts Internal Profiling

•  Need to determine which methods are hot or cold •  Invocation counting

•  Handled by bytecode interpreter or including an add instruction to native code

•  Can have noticeable run-time overhead •  Thread sampling

•  periodically check thread code, register instruction pointers •  Minimising application disruption requires custom thread

implementation or OS support

•  Hardware based sampling •  Platform specific instrumentation mechanisms

JIT Compilation Facts JIT Assumptions

•  Methods will probably not be overridden •  Can be called with a fixed address

•  A float will probably never be NaN •  Use hardware instructions rather than floating point library

•  Exceptions will probably not be thrown in a try block •  All catch blocks are marked as cold

•  A lock will probably not be saturated •  Start as a fast spinlock

•  A lock will probably be taken and released by the same thread •  Sequential unlock/acquire operations can be treated as a no-op

Inlining and Virtualisation Competing Forces

•  Most effective optimisation is method inlining •  Virtualised methods are the biggest barrier to this •  Good news:

•  JIT can de-virtualize methods if it only sees 1 implementation •  Makes it a mono-morphic call

•  Bad news: •  if JIT compiler later discovers an additional implementation it

must de-optimize •  Re-optimise to make it a bi-morphic call •  Reduced performance, especially if extended to third method

and multi-morphic call

Inlining and Virtualisation Important Points

•  Implementation changes during “steady-state” •  Will slow down application

•  Write JIT friendly code? •  No! Remember, “Beware premature optimisation”

•  What to do? •  Code naturally and let the JIT figure it out •  Profile to find problem areas •  Modify code only for problem areas to improve performance

Conclusions

•  Java uses a virtual machine, so: •  Has automatic memory management •  Has adaptive compilation of bytecodes

•  How these features work will have a significant impact on the performance of your application

•  Profile, profile, profile! •  Avoid premature optimisation

Resources

•  www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

•  www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

•  middlewaremagic.com/weblogic/?p=6930

Low latency Java apps

Technology

Transcript of Low latency Java apps