Monitoring Java On The Mainframe - GSE...
Transcript of Monitoring Java On The Mainframe - GSE...
Monitoring Java On The
MainframeMainframe
Speaker name Dave Swift
Speaker company IBM
Date of presentation (01/11/2016)
Session OA
• This deck is intended to provide a brief overview
of the IBM OMEGAMON for JVM product
• Sections within this presentation are:– Why monitor JVMs on z/OS
Purpose of this Presentation
– Introducing IBM OMEGAMON for JVM
– Resource monitoring of z/OS Connect Enterprise Edition
– Example scenarios
– Further Information
– = (New in OMEGAMON for JVM V5.4.0)
Why Monitor JVMs?� Clear understanding of how many and what JVMs are
running within an LPAR
� Need to address tuning issues in operations environment
� Highlight degradation of performance over time
� Leaks leading to inefficient or excessive Garbage Collections
� Diagnose potential OutOfMemory conditions� Diagnose potential OutOfMemory conditions
� Heap issues and native memory leaks
� Contention for shared resources
� Poor performance caused by multiple threads waiting on resources
� Sub-optimal CPU utilisation
� Is work running on general CPU or specialty engines?
� Operating within the correct environment
� Are there subsystems/applications running with insecure JVM levels or settings?
� Are the correct application versions deployed?
Introducing OMEGAMON for JVM on z/OS
• Brand new OMEGAMON monitoring agent focused on assisting z/OS system administrators, operators and SMEs identify problems, resolve quicker and optimize performance
• Lightweight overhead compared to other offerings. – 90% of data collected is through Health Center API
• Ability to view all JVMs side-by-side. No disconnect when switching between JVMs
• Collects data on any online JVM on z/OS– Subsystems: CICS, IMS DB2, WAS, z/OS Connect, ODM
– Standalone Batch USS Java applications
– Can identify and distinguish Liberty JVM servers
• Data presented on both OMEGAMON enhanced 3270 UI and Tivoli Enterprise Portal
• Reports on Garbage Collection, Active Threads, Lock Utilization, JVM Environment, CPU Utilization, Native Memory
• Detailed report on z/OS Connect EE resources for hybrid cloud monitoring
• Provides the standard OMEGAMON features: – Look back in time with historical data collection
– Be alerted to abnormal conditions through defined event generation (Situations)
– Easy to configure and deploy using PARMGEN
z/OS Connect
JVM
Liberty
JVM
Batch / USS
JVM
CICS TS
JVM
WAS on z/OS
JVM
DB2 on z/OS
JVM
IMS
JVM
ODM on z/OS
JVM
OMEGAMON
JVM
Agent
DB2
IMS
CICS
WAS
z/OS ConnectEnterprise Edition
z/OS Connect EE Resource
Monitoring
OMEGAMON for JVM
WAS
Identify Service/API performance issues within z/OS Connect EE instances faster and avoid
bottlenecks
Data ProvidedHighest JVM Statistics
Thread DetailsJVM Environment
z/OS Connect EE Request MetricsAverage Response Time
Slowest Services
Native Memory
CPU StatisticsGarbage Collection
StatisticsLock Details
JVM Command Line
System Variables
Env Variables
JVM Parameters
Classpath
Boot Classpath
GET Count
Average Hold Time
Slow Gets
Recursive Acquires
Lock Utilization %
Thread State
Contending Object
Stack Trace
Nursery GC Details
Global GC Details
% Time Paused
Heap Allocation
General CPU
Specialty Processor (IFA) CPU
Specialty Processor Work on
General CPU
LE Heap Details
z/OS Extended Region
Detail
Java Native Memory
Scenario: Visibility of all JVMs • JVMs can be found all over the environment.
Can you be clear what is online, are there JVMs
online that are unplanned?
• Starting the JVM Monitor will seek out and find
all JVMs on an LPAR regardless of subsystem
type whether they have been configured for full
monitoring or not.
How much Java
are we running?
We need to see
all JVMs that
are currently
online• The agent will capture the jobname, ASID,
subsystem type and basic details of the JVM.
CICS TS
JVM
WAS
JVM
DB2
JVM
IMS
JVM
OMEGAMON JVM
Agent
online
LPAR
Scenario: Visibility of all JVMs
For a JVM to be fully monitored, it must be instrumented to allow OMEGAMON to collect
data. If not, we can still determine online JVMs and their subsystem type. These are reported on the second subpanel here. A user can then determine if they want to instrument that JVM
for full monitoring.
Scenario: Visibility of all JVMs
Equivalent Tivoli Enterprise Portal screen showing JVMs currently being fully showing JVMs currently being fully
monitored and those detected as being online but not monitored by JVM agent
Scenario: Visibility of all JVMs
• To enable full monitoring of a JVM it must be instrumented to allow the OMEGAMON agent to interact with the JVM and issue requests via the Health Center API.
• Typical configuration is a minor change to the JVM startup parameters:
-Xhealthcenter:level=inprocess
-javaagent:/omegamon/uss/install/dir/kan/bin/IBM/kjj.jar
• OMEGAMON code will collect JVM environment information, capture JVM events (for example GCs) and push the details to the OMEGAMON JVM agent.and push the details to the OMEGAMON JVM agent.
CICS TS
JVM
WAS
JVM
DB2
JVM
IMS
JVM
OMEGAMON JVM
Agent
LPAR
Scenario: Optimizing Garbage
Collection• “Performance of JVM is poor. Could Garbage Collection be a
cause?”
• Performance of the Garbage Collector has improved significantly in recent releases of Java however poor performance can still occur due to:
• Insufficient heap allocation
• Poorly written applications
• The symptoms of such problems might be:
Performance of
JVM is poor.
What can be • The symptoms of such problems might be:
• Excessive number GC events occurring within a given period of time
• High heap occupancy even after a GC
• Long pause times when GC event is occurring
• System GC events occurring
• The Garbage Collection Details workspaces provide insight into the performance of the JVM GC allowing the operator to confirm (or dismiss) the JVM as a bottleneck in the performance throughput.
What can be
causing this?
Scenario: Optimizing Garbage
Collection
The Highest JVM Statistics subpanel shows the
poorest performing statistics in key GC metrics
If a threshold is exceeded (example GC Rate per
Minute), zoom into the JVM that potentially has
an issue.
Scenario: Optimizing Garbage
Collection
GC Details can point out key values that may
indicate a problem. A rolling 5 minute interval is
used to scale values.
Does the Occupancy look OK? Average Heap
size fine?
Scenario: Out of Memory Conditions
• The java.lang.OutOfMemoryError is a severe condition which often occurs with little warning, and usually brings down the JVM. The error occurs when the system runs out of memory – either Java heap space or native memory.
• OMEGAMON for JVM constantly monitors the proportion of the maximum Java heap size that is still allocated after garbage collection. If that value exceeds 80% a situation is triggered which can take actions
The address space
periodically abends.
Can we see what 80% a situation is triggered which can take actions such as alerting operations staff or application SMEs. If the condition escalates, then the application can be restarted in an orderly fashion before it crashes or impacts end-users.
• The Native Memory analysis provides details which can help identify constraint in native memory. Either the address space is over-committed or an application has a memory leak. By analyzing metrics such as Language Environment Heap utilization and Extended Region Free %, OMEGAMON can avert major outages due to native memory.
Can we see what
caused the issue?
Scenario: Out Of Memory Condition
Select a Job Name using the action menu and select option 'N' for 'Native Memory'
Scenario: Out Of Memory Condition
If the Extended Region Free % falls below 10 and continues to fall, it is an indication of a
native memory leak. If the value falls below 5, then the JVM may need to be shut down and
restarted
Scenario: Identify Possible
Memory Leak
Can we be
alerted to
memory issues
A snapshot of data taken a regular intervals to allow viewing of system status a specified
point in the past
memory issues
before it causes
an abend?
Scenario: Identify Possible Memory
Leak
A creeping rise in the heap occupancy after a
GC has been performed is a sign of a possible GC has been performed is a sign of a possible
memory leak. Unaddressed could lead to Out Of
Memory Error and JVM abend and core dump
Scenario: Identifying Locks and
Thread Blocks
• If not GC issues, perhaps threads are being blocked for an excessive period of time or locks within the JVM are being held for long periods causing application to wait for the monitor to yield.
• If high values found here, the application owner (if
Our applications are
performing poorly.
Can we see what • If high values found here, the application owner (if applicable) can be alerted or adjustments to the JVM environment could be made.
Can we see what
might be the cause?
Scenario: Identifying Locks and
Thread Blocks
Thread Statistics drills-down to all active threads
making BLOCKED threads easy to spot.
NEW in V540 –
Also shows Thread CPU to spot loops!
Scenario: Identifying Locks and
Thread Blocks
The Lock Statistics shows which monitor objects
were used as lock most often an how long they
were held for.
Scenario: Identify Environment
Issues• We are able to deep-dive into JVM environment details to view
information like the classpath, system properties and the version of Java being used.
We need to
ensure the Java
• We can also define a situation to check setting and alert us to a problem. In this case, if a ‘bad’ Java version is being used
levels being
used are up to
date
Scenario: Identify Environment
Issues
In the TEP Situation Editor we create a new Situation to check against the JVMs
Version attribute. If this condition is ever met, a Warning alert will be raised.
Scenario: Identify Environment
Issues
Once the situation is tripped, you can analyze the current conditions, identify the offending job and take appropriate
actionaction
Scenario: Identify Environment
Issues
The Situation Status Tree in enhanced 3270 UI will if there is a JVM online with the offending Java level. A user could then take
appropriate action
Scenario: Slow API Response Time
• It’s important to be alerted to poor response time to services/APIs you are making available to consumers, potentially externally, to satisfy application performance and manage varying workloads before application owners raise complaints.
Reports are coming
back that application
request response
time into z/OS is application owners raise complaints.
• The z/OS Connect Summary workspace displays all the current z/OS Connect Services that are executed in the last 5 minutes. This workspace helps identify slow requests and in conjunction with the Garbage Collection or Threads workspace, specific causes of the symptom can be found
time into z/OS is
poor. Can we
identify affected
services?
Scenario: Slow API Response Time
Identify the z/OS Connect Job by looking at the Application field. Select the Job using option 'Z'
Sort the rows by 'Avg Response Time' - Identify and select the service name with highest Avg Response Time. Selecting option 'S' will display more detailed information
about a particular request
Scenario: Slowest z/OS Connect
Services
• There may be certain properties that will point us to a problem around the reported service performance issue. Maybe there is something specific about the slowest requests, the client connected, or the payload being submitted.
Response time of
services through a
z/OS Connect EE
instance is slow. being submitted.
• The z/OS Connect Slowest Requests display the five worst performing requests over the last 5 minutes for a particular z/OS Connect service. This workspace can be used to provide diagnostic information about a specific request, which can help determine why a particular request performed poorly
instance is slow.
Can we investigate
requests to deduce
the issue?
Scenario: Slowest z/OS Connect Services
Identify the desired Service Name you want more details for and select it with option 'S'
Scenario: Slowest z/OS Connect Services
Here you can view the five slowest requests, their request ID, method, response time, etc
Requests longer in length may take longer as they are sent as JSON –which might have overhead depending on the subsystem being called.
In addition, if the Response length is 0, there is no JSON response and the request may have encountered an error which could also cause a
slow response time
The OMEGAMON Portfolio
Service Management Suite on z/OS
OMEGAMON Performance Management Suite on
z/OSOMEGAMON for
CICSOMEGAMON for
IMS
Service Management Unite
NetView for z/OSSystem Automation for
z/OSTivoli Asset Discovery
OMEGAMON z/OS Management SuiteOMEGAMON on
z/OS
OMEGAMON for JVM
OMEGAMON Mainframe Networks
OMEGAMON for Storage
OMEGAMON Dashboard Edition
OMEGAMON for DB2 PE
OMEGAMON for Messaging
ITCAM for Application Diagnostics
More Information/References� OMEGAMON Product Home
� Overview and product information for all OMEGAMON products
� www.ibm.com/OMEGAMON
� Service Management Connect
� Blogs, forums, articles, best practices videos for IBM z Systems monitoring
� www.ibm.com/developerworks/servicemanagement/z
� Examples:� Examples:
� Introducing OMEGAMON Monitoring for JVM
� Using OMEGAMON to Diagnose Slow JVMs Through Thread Data
� OMEGAMON JVM monitoring for z/OS Locking Data
� OMEGAMON Monitoring for JVM Technote
� Summary of latest fixes, known issues and updates
� www.ibm.biz/OMEGJVMTechnote
Contacts
� Offering Management
� Nathan Brice [email protected]
� Chris Walker [email protected]
� Release Management� Release Management
� Jeff Summers [email protected]
� Dan Kitay [email protected]
� Marketing Enablement
� John Knutson [email protected]
� Sales Enablement
� Giulio Peri [email protected]
� Diego Bessone [email protected]
Video Overview� Short 4 minute overview introducing OMEGAMON Monitoring Feature for JVM…
� YouTube Direct Link: https://youtu.be/QcqnD_B3xsg
� Service Management Connect Blog with video embedded: www.ibm.biz/OMEGJVMVideoBlog
Session feedback
• Please submit your feedback at
http://conferences.gse.org.uk/2016/feedback/nn
• Session is nn
This is the last
slide in the deck
• Session is nn