Understanding and tuning Garbage Collection for … · 2011. 11. 29. · Understanding and tuning...
Transcript of Understanding and tuning Garbage Collection for … · 2011. 11. 29. · Understanding and tuning...
GC Tuning
1
Understanding and tuning Garbage
Collection for
BankfusionUniversalBanking
GC Tuning
2
Action Name Date Notes Version
Create Shainesh baheti 04th
Jan 2011 White paper on BFUB GC
tuning
0.1
GC Tuning
3
Contents
What's in this? ......................................................................................................................................... 5
Basic understanding of GC ....................................................................................................................... 6
Verbose GC logs and its analysis .......................................................................................................... 6
Location to get verbose GC logs: ...................................................................................................... 6
Analyzer .......................................................................................................................................... 7
Why BankfusionUB needs optimal GC? .................................................................................................... 8
Test case and Server specification ........................................................................................................... 8
Test case ............................................................................................................................................. 8
Server specification ............................................................................................................................. 8
Description of current GC settings and problems associated with that. .................................................... 9
Default GC settings .............................................................................................................................. 9
Results ............................................................................................................................................. 9
Less Initial memory (Keeping Initial memory 1/4th
of maximum heap) ............................................... 10
Reasons ......................................................................................................................................... 10
Solution ......................................................................................................................................... 10
Results ........................................................................................................................................... 10
Observations with Change settings ................................................................................................ 10
Which is best GC policy for BFUB? ..................................................................................................... 11
Reasons ......................................................................................................................................... 11
Solution ......................................................................................................................................... 11
Optavgpause ..................................................................................................................................... 11
Result ............................................................................................................................................ 12
Observations with Change settings: ............................................................................................... 13
Gencon .............................................................................................................................................. 13
Results ........................................................................................................................................... 15
Observations with Change settings ................................................................................................ 16
Subpool ............................................................................................................................................. 16
Results ........................................................................................................................................... 17
Limitation for Subpool ................................................................................................................... 18
GC Tuning
4
Tuning gencon policy ......................................................................................................................... 19
Reason........................................................................................................................................... 19
Solution ......................................................................................................................................... 19
Results ........................................................................................................................................... 19
Observations with Change settings ................................................................................................ 19
Remove explicit GC ............................................................................................................................ 20
Reasons ......................................................................................................................................... 20
Solution ......................................................................................................................................... 20
Why not test with full M&D (Teller with ATM, BPW, Lending, collateral and core) and see the impact?
.......................................................................................................................................................... 20
Reasons ......................................................................................................................................... 20
Solution ......................................................................................................................................... 20
Why not run full EOD and see the impact? ......................................................................................... 21
Still Grey areas in Bankfusion UB ........................................................................................................... 22
Finalizer ............................................................................................................................................. 22
LOA ................................................................................................................................................... 22
There are no dumb questions in BankfusionUB...................................................................................... 23
Conclusion/Best Practices ...................................................................................................................... 24
Sources ................................................................................................................................................. 24
GC Tuning
5
What's in this?
Just Garbage ----Garbage and Garbage………….All
Garbage……
But you will know why garbage is so important for
BFUB application!!!
Clearing of Garbage!!!
GC Tuning
6
Basic understanding of GC
I don’t want to go in details of GC, Memory management and all blah blah. But just to refresh so that
understanding of result will become easier, here is quick refresher.
Conceptually, garbage collection (GC) creates the illusion of infinite free space.
– Java has a create (“new”) but no destroy
– Applications create objects as needed on the Heap
In reality, GC reclaims unused memory back to the free lists
– Finds objects that are no longer used
– Makes their storage available for allocation
All garbage collectors follow the same formula
• Find all live objects (Mark)
– Trace the object graph from a set of known starting points (e.g., Thread stacks). Known as “The Root
Set”
• Recycle objects not found onto the free list (Sweep)
– Objects not visible in the live set are “dead”
• Optional: Move objects to reduce fragmentation (Compact)
– Free bits of memory here and there create holes
– Cannot allocate object even if total free space is sufficient
– Converts many small holes into fewer large ones
IBM Java GC has a number of selectable policies under which it will recycle objects
Why have many policies? Why not just “the best”?
– Cannot always dynamically determine what tradeoffs the user/application is willing to make
• Pause time vs. Throughput
• Footprint vs. Frequency
This is why we are tuning GC for BFUB application.
Verbose GC logs and its analysis
Location to get verbose GC logs:
Application servers > server1 > Process definition > Java Virtual Machine
GC Tuning
7
Also we can use -verbose:gc option is the main diagnostic that is available for runtime analysis
of the Garbage Collector can be put in generic JVM argument.
The native_stderr.log file will be generated on following location:
/ibm/WebSphere7/AppServer/profiles/AppSrv01/logs/server1
Analyzer
For the analysis, we can use following tools from IBM:
• IBM support Assistant - Garbage Collection and Memory Visualizer
• GC Analyzer – ga402
Please don’t try sun jdk in IBM GC analyzer and vice versa. Why? you better know!!
Also, please don’t ask why I am using only IBM tools!!!
GC Tuning
8
Why BankfusionUB needs optimal GC?
• If JAVA is there than garbage will be there. So how BFUB a high throughput application
will survive from it.
• Although IBM has given tuned and intelligent GC parameters but still given leeway of
the application like BFUB to further optimize it.
• Don’t we want Tier 1 bank!!!!
Always remember Our Aim!!
• Reduce GC pause time overhead as much as possible.
• To improve performance of BFUB and make it scalable.
How GC pause time will lead to improved performance ,we will see in the document!!
Test case and Server specification
We need some BFUB application module to confirm the best GC settings!!!
Test case
Test case Description
Online Teller module for 75 users
Batch Interest Accrual and Interest Accrual posting Batch process
Also, Server specification is important for GC settings!!
Server specification
Server Type Application Configuration
App Server WAS 7 8 Cores, 16 GB RAM,IBM, 9133-55A, Power 5, 1.65 GHz,64 bit
DB Server DB2 9.7 8 Cores, 16 GB RAM,IBM, 9133-55A, Power 5, 1.65 GHz,64 bit
GC Tuning
9
Description of current GC settings and problems associated with that.
The current setting is default with Initial heap size is 1/4th
size of Maximum heap and it is using
optthruput GC policy.
Default GC settings
Policy Option Description
Optimize
for
throughput -Xgcpolicy:optthruput (optional)
(optional)The default policy. It is typically used for
applications where raw throughput is more important
than short GC pauses. The application is stopped each
time that garbage is collected.
It works as given in below figure:
Results
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.
CPU (%)
%
improvem
ent in
response
time
517
1399
19
41
Min512Max2048- optthruput [baseline] 79.2 62.9
Baseline
Batch
GC time
(sec)
Accrual
(h:mm:ss)
Posting
(h:mm:ss) collection Test type
Average GC
Overhead
%
Max GC
Overhead%
85 0:10:33 0:03:21 245
Min512Max2048-
optthruput
[baseline] 6 90
GC Tuning
10
Problem Statement 1
Less Initial memory (Keeping Initial memory 1/4th
of maximum heap)
Reasons
-Overhead for contraction and expansion of heap will take place during GC which will add to GC
pause time.
- Compaction will occur if the heap is too small or fragmented or if the heap is resized and that
will add to GC pause time.
-More No. of collection is observed as frequent GC is taking place and overall Average overhead
is 19% in online and 6% in batch which is too high.
Solution
Keep Initial memory equal to Maximum memory
Results
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.
CPU (%)
%
improveme
nt in
response
time
517
1399
19
41
Min512Max2048-
optthruput
[baseline] 79.2 62.9
Baseline
283
777
11
69 Min=Max2048 -
optthruput
75.9
62.5
6.93849390
Batch
GC time
(sec)
Accrual
(h:mm:ss)
Posting
(h:mm:ss) collection Test type
Average GC
Overhead
%
Max GC
Overhead%
85 0:10:33 0:03:21 245
Min512Max2048-
optthruput
[baseline] 6 90
29 0:10:22 0:03:10 79 Min=Max2048 -optthruput
3 87
Observations with Change settings
-No contraction/expansion. Thus no time is waste in heap resizing.
-No compact time wasted as contiguous heap is available throughout the test as no resizing is
needed.
-Fewer no. of collections were observed and overall Average overhead is reduced to 11% in
online and 3% in batch.
GC Tuning
11
- Around 7% improvement in response time.
Although we solve one problem still BFUB is taking 11% overhead in online process. So still it is red hot
zone. Now let’s take next important aspect “GC policy” which is nothing but different way of doing GC.
Problem Statement 2
Which is best GC policy for BFUB?
Reasons
• GC overhead is around 19% which is too high so we cannot just rely on optimum throughput
(optthruput GC policy-default).
• Need to optimize GC policy which can utilize concurrent GC as well as can distinguish between
short lived and long lived objects.
• Performance improvement if any can be possible by reducing GC activities.
Solution
IBM has given so many GC policies to solve our concern. Let’s test one after the other!!!
Optavgpause (should never be used for BFUB)
Policy Option Description
Optimize
for pause
time -Xgcpolicy:optavgpause
Trades high throughput for shorter GC pauses by
performing some of the garbage collection
concurrently. The application is paused for shorter
periods.
optavgpause is an alternative GC policy designed to keep pauses to a minimum. It does not guarantee a
particular pause time, but pauses are shorter than those produced by the default GC policy. The idea is
to perform some garbage collection work concurrently while the application is running. This is done in
two places:
• Concurrent mark and sweep: Before the heap is filled up, each mutator helps out and mark s
objects (concurrent mark). There is still a stop-the-world GC, but the pause is significantly
shorter. After GC, the mutator threads help out and sweep objects (concurrent sweep).
• Background GC thread: One (or more) low-priority background GC threads perform marking
while the application is idle.
GC Tuning
12
Result
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.
CPU (%)
%
improveme
nt in
response
time
517
1399
19
41
Min512Max2048-
optthruput
[baseline] 79.2 62.9
Baseline
283
777
11
69 Min=Max2048 -
optthruput
75.9
62.5
6.9384939
513
513
0
13 Min=Max2048 -
optavgpause
94.4
85.7
-593.36253
GC Tuning
13
Batch
GC time
(sec)
Accrual
(h:mm:ss)
Posting
(h:mm:ss) collection Test type
Average GC
Overhead
%
Max GC
Overhead%
85 0:10:33 0:03:21 245
Min512Max2048-
optthruput
[baseline] 6 90
29 0:10:22 0:03:10 79 Min=Max2048 -optthruput
3 87
11 0:10:11 0:03:18 84
Min=Max2048 -
optavgpause 1 75
Observations with Change settings:
As IBM says, there is obvious degradation in the throughput (around 5 – 10%), GC happens concurrently
with the application threads (concurrent Mark and sweep), that’s the reason throughput is degrading.
But BFUB application shows more than what IBM says it shows 600% degradation in response time.
But we need to see, why response time is degrading that much .The reason might be CPU utilization is
more by concurrent mark and sweep thread and thus fewer resources available for BFUB application.
Thus as of now, permanent Bye to Optavgpause GC policy for BFUB application.
Gencon (The best!!)
Policy Option Description
Generational
concurrent -Xgcpolicy:gencon
Handles short-lived objects differently than objects
that are long-lived. Applications that have many
short-lived objects can see shorter pause times with
this policy while still producing good throughput.
A generational garbage collection strategy considers the lifetime of objects and places them in separate
areas of the heap. In this way, it tries to overcome the drawbacks of a single heap in applications where
most objects die young -- that is, where they do not survive many garbage collections.
With generational GC, objects that tend to survive for a long time are treated differently from short-
lived objects. The heap is split into a nursery and a tenured area, as illustrated in Figure 4. Objects are
created in the nursery and, if they live long enough, are promoted to the tenured area. Objects are
promoted after having survived a certain number of garbage collections. The idea is that most objects
are short-lived; by collecting the nursery frequently, these objects can be freed up without paying the
cost of collecting the entire heap. The tenured area is garbage collected less often.
GC Tuning
14
As you can see in Figure, the nursery is in turn split into two spaces: allocate and survivor. Objects are
allocated into the allocate space and, when that fills up, live objects are copied into the survivor space or
into the tenured space, depending on their age. The spaces in the nursery then switch use; with allocate
becoming survivor and survivor becoming allocate. The space occupied by dead objects can simply be
overwritten by new allocations. Nursery collection is called a scavenge; Figure illustrates what happens
during this process:
When the allocate space is full, garbage collection is triggered. Live objects are then traced and copied
into the survivor space. This process is really inexpensive if most of the objects are dead. Furthermore,
objects that have reached a copy threshold count are promoted into the tenured space. The object is
then said to be tenured.
As the name Generational concurrent implies, the gencon policy has a concurrent aspect to it. The
tenured space is concurrently marked with an approach similar to the one used in the optavgpause
policy, except without concurrent sweep. All allocations pay a small throughput tax during the
concurrent phase. With this approach, the pause time incurred from the tenure space collections is kept
small.
Figure shows how the execution time maps out when running gencon GC:
Distribution of CPU time between mutators and GC threads in gencon
GC Tuning
15
A scavenge is short (shown by the small red boxes). Gray indicates that concurrent tracing starts
followed by a collection of the tenured space, some of which happens concurrently. This is called a
global collection, and it includes both a scavenge and a tenure space collection. How often a global
collection occurs depends on the heap sizes and object lifetimes. The tenured space collection should be
relatively quick because most of it has been collected concurrently.
Results
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.
CPU (%)
%
improvem
ent in
response
time
517
1399
19
41
Min512Max2048-
optthruput
[baseline] 79.2 62.9
Baseline
283
777
11
69 Min=Max2048 -
optthruput
75.9
62.5
6.9384939
513
513
0
13 Min=Max2048 -
optavgpause
94.4
85.7
-593.3625
155
2206
6
100 Min=Max2048 -
gencon 76.2 50.9
39.198494
Batch
GC time
(sec)
Accrual
(h:mm:ss)
Posting
(h:mm:ss) collection Test type
Average GC
Overhead
%
Max GC
Overhead%
85 0:10:33 0:03:21 245
Min512Max2048-
optthruput
[baseline] 6 90
29 0:10:22 0:03:10 79 Min=Max2048 -optthruput
3 87
11 0:10:11 0:03:18 84
Min=Max2048 -
optavgpause 1 75
20 0:10:00 0:03:10 280 Min=Max2048 -GenCon 2 100
GC Tuning
16
Observations with Change settings
-The mean occupancy in the nursery is 3%. This is low, so the gencon policy is probably an optimal policy
for this workload.
-Approximately 40% improvement in response time.
-Average GC overhead is reduced to 6% in online and 2% in batch.
-Reduction of around 12% in CPU usage.
Subpool (Ok but…)
Policy Option Description
Subpooling -Xgcpolicy:subpool
Uses an algorithm similar to the default policy's but
employs an allocation strategy that is more suitable
for multiprocessor machines. We recommend this
policy for SMP machines with 16 or more processors.
This policy is only available on IBM pSeries® and
zSeries® platforms. Applications that need to scale on
large machines can benefit from this policy.
The subpool policy can help increase performance on multiprocessor systems. As I mentioned earlier,
this policy is available only on IBM pSeries and zSeries machines. The heap layout is the same as that for
the optthruput policy, but the structure of the free list is different. Rather than having one free list for
the entire heap, there are multiple lists, known as subpools. Each pool has an associated size by which
the pools are ordered. An allocation request of a certain size can quickly be satisfied by going to the pool
with that size. Atomic (platform-dependent) high-performing instructions are used to pop a free list
entry off the list, avoiding serialized access. Figure shows how the free chunks of storage are organized
by size:
GC Tuning
17
Subpool free chunks ordered by size
When the JVMs start or when a compaction has occurred, the subpools are not used because there are
large areas of the heap free. In these situations, each processor gets its own dedicated mini-heap to
satisfy requests. When the first garbage collection occurs, the sweep phase starts populating the
subpools, and subsequent allocations mainly use subpools.
The subpool policy can reduce the time it takes to allocate objects. Atomic instructions ensure that
allocations happen without acquiring a global heap lock. Mini-heaps local to a processor increase
efficiency because cache interference is reduced. This has a direct effect on scalability, especially on
multiprocessor systems. On platforms where subpool is not available, generational GC can provide
similar benefits.
Results
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.
CPU (%)
%
improve
ment in
response
time
517
1399
19
41
Min512Max2048-
optthruput
[baseline] 79.2 62.9
Baseline
283
777
11
69 Min=Max2048 -
optthruput
75.9
62.5
6.938493
513
513
0
13 Min=Max2048 -
optavgpause
94.4
85.7
-593.3625
155
2206
6
100 Min=Max2048 -
GenCon 76.2 50.9
39.19849
245
638
7
89
Min=Max2048 -
subpool 70.9 56.3
22.92960
GC Tuning
18
Batch
GC time
(sec)
Accrual
(h:mm:ss)
Posting
(h:mm:ss) collection Test type
Average GC
Overhead
%
Max GC
Overhead
%
85 0:10:33 0:03:21 245
Min512Max2048-
optthruput
[baseline] 6 90
29 0:10:22 0:03:10 79 Min=Max2048 -optthruput
3 87
11 0:10:11 0:03:18 84
Min=Max2048 -
optavgpause 1 75
20 0:10:00 0:03:10 280 Min=Max2048 -GenCon 2 100
24 0:09:36 0:03:10 66 Min=Max2048 -subpool 3 94
Limitation for Subpool
- The subpool policy can help increase performance on multiprocessor systems. As I mentioned
earlier, this policy is available only on IBM pSeries and zSeries machines.
- Overhead is more =7% Compaction is happening so increasing AF/GC pause time during the test
- On an Average GC Pause time is more as more amount of memory need to reclaim in Subpool
(375ms) than in gencon (120ms). [Look at GC pause time in the graph below].
Now gencon is the best GC policy for BFUB application has been proved by us.
Now What!!!
Problem Statement 3
GC Tuning
19
Tuning gencon policy
Reason
As discussed earlier nursery is nothing but where short lived object are stored and thus moved to tenure
if it survive certain no. of GC. During our analysis we observed that there is shorter lived object rather
than long lived object in BFUB and thus playing with nursery size will land up us in some positives.
Solution
Tuning nursery size (Xmn<size>)
-default (25% of Max heap size) and remaining (75% of Max heap size) with tenure.
-50% of Max heap size and remaining (50% of Max heap size) with tenure.
-75% of Max heap size and remaining (25% of Max heap size) with tenure.
Results
Online
GC
time
(sec)
Collection Average GC
Overhead
%
Max GC
Overhead
%
Test type Max CPU
(%)
Avg.CPU
(%)
%
improve
ment in
response
time
155
2206
6
100
Min=Max2048 –
gencon (Default
(25%) nursery size) 76.2 50.9
39.19849
107
1082
3
100
Min=Max2048 -
gencon(50% nursery
size)
65.6 50.6
46.18440
92
724
2
100
Min=Max2048 -
gencon(75% nursery
size)
64.6
49.9
25.32519
Observations with Change settings
- There is continuous Nursery heap occupancy was observed in 25% nursery size settings.
- On an Average GC Pause time is more as more amount of memory need to reclaim in 75%
nursery size settings (137ms) than in 50% nursery size settings (120ms).
- The mean occupancy in the tenured area is 87% in 75% nursery size setting which is high.
- The Best setting (46% improvement in response time) was observed for 50% nursery size
settings.
- Fewer no. of global garbage collection was observed in 50% nursery size (4) than 75% nursery
size (9).
Problem Statement 4
GC Tuning
20
Remove explicit GC
Reasons
The use of System.gc () is generally not recommended since they can cause long pauses and do not
allow the garbage collection algorithms to optimize themselves.
Solution
-Xdisableexplicitgc. So we will not give any chance for BFUB code for occurrences of System.gc ().
Thus, we saw there was great improvement in CPU Utilization and Response time of transactions in
online mode and significant reduction in GC pause time. On the other hand, elapsed time of batch
process is more or less same (as GC overhead was less) but optimized GC settings reduced GC pause
time.
Problem Statement 5
Why not test with full M&D (Teller with ATM, BPW, Lending, collateral and core) and
see the impact?
Reasons
We should know how the impact when all the modules are running which is the real time scenario in
banks.
Solution
GC Tuning
21
GC time
(sec)
Collection Average
GC
Overhead
%
Max GC
Overhead
%
Test type Max
CPU %
Avg.
CPU %
%
improvement
in response
time
570 1676 23 95
Min512Max2048-
optthroughput
[baseline] 85.6 67.2 Base
162 1431 3 100
Min=Max2048 -
GenCon/Xmn1024 71 53.3 46
Problem Statement 6
Why not run full EOD and see the impact?
It is good to have expectation that we will able to improve overall EOD timings by this.
GC Tuning
22
Still Grey areas in Bankfusion UB
Finalizer
Using finalizers is not recommended as it can slow garbage collection and cause wasted space in
the heap. We have to Consider review BFUB application for occurrences of the finalize () method. We
can use the ISA Tool Add-on, IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer to list
objects that are only retained through finalizers.
LOA
LOA is large object allocation. A large object is the object which occupies more than 64k in Heap.
More the Large object more the GC pause time and thus minimization or prohibition of large object
should take place.
GC Tuning
23
There are no dumb questions in BankfusionUB
Q1. How GC settings changes with system and transaction?
From above discussion, it is cleared that Product has to decide whether it is needed response time
improvement or throughput improvement or trade off. Thus transaction requirement is very much
important wrt GC settings. Also, System also plays important role as GC is CPU intensive activity and
each GC thread depend on the configuration. Thus GC settings should be set according to system
configurations.
Q2. Why not give maximum heap if available with system?
No, this is the myth that we can give maximum amount of heap. Always GC settings should be
recommended by iteratively testing and comparing the result. The Problem with large heap is it may
need to clear big pile of heap and which will pause the system for longer duration and also as heap is
fragmented GC thread may required longer time to mark the objects.
Q3. Give some magic formula to give to customer so that BFUB can implement in customer
environment?
No magic formula but magic analysis can be done according system requirement and transaction Peak
with some of our magic tools
GC Tuning
24
Conclusion/Best Practices
Hope GC settings explained in the document will take care of everything!!
But still GC settings are dependent on system and transaction volume.
Thus, Final GC settings:
Initial heap size = Maximum heap size
GC Policy: gencon (-Xgcpolicy:gencon)
Nursery size = 50% of Maximum heap size (-Xmn1024m)
Disable explicit GC= -Xdisableexplicitgc
To print the GC parameters = -verbose: sizes
Maximum heap size will depend on the system configuration and throughput requirement.
Sources
http://www.iecc.com/gclist/GC-faq.html
http://www.ibm.com/developerworks/ibm/library/i-gctroub/
http://www.ibm.com/developerworks/java/library/j-ibmjava2/
http://www.ibm.com/developerworks/java/library/j-ibmjava3/
http://www.performancewiki.com/was-tuning.html
http://publib.boulder.ibm.com/infocenter/ieduasst/v1r1m0/index.jsp?topic=/com.ibm.iea.was_v7/was/
7.0/ProblemDetermination/WASv7_GCMVOverview/player.html
http://publib.boulder.ibm.com/infocenter/javasdk/v6r0/index.jsp?topic=/com.ibm.java.doc.diagnostics.
60/diag/appendixes/cmdline/cmdline_gc.html