Real Life Java EE Performance Tuning
description
Transcript of Real Life Java EE Performance Tuning
Real Life Java EE Performance Tuning
Matt BrasierPrincipal ConsultantC2B2 Consulting [email protected]
About MeProfessional Services ConsultantCustomers include• Red Hat (JBoss)• BEA• Cape Clear• Government/Finance/Telecoms
C2B2 Consulting• SOA and Java EE consultancy• Fast, Reliable, Manageable, Secure
What we will cover Philosophy• How I approach a performance problem situation
Enterprise Java Performance• What kind of things affect performance of Enterprise Systems
Case Study 1• A new version of the application runs slowly
Case Study 2• Logging in takes a long time in the live environment
Case Study 3• The application does not scale
What we will learnPhilosophy• Suggestions to keep in mind when looking at a
performance problem
Tools • Suggested tools for looking at a performance
problem
Techniques• How to use the tools, knowledge and skills to solve
your performance problem
Philosophy‘A good understanding’ is the best
performance tuning toolPrefer common and open source toolsObserve, Hypothesize, Tweak, Test‘Trust no-one’
Classic Java performance problemsMemory leaks• Increased GC Time
Poor GC or JVM Memory configurationCPU bound codeIO bound codeMemory bound code• Increased GC time
Enterprise Java PerformanceCAVEAT: Consultancy Selection Bias80/20: 80% of time finding, 20% fixingMany ‘Enterprise’ Java performance problems turn
out not to be ‘classic’ performance bottlenecks• Infrastructure/Middleware performance
There are many factors that can affect the performance of an enterprise system• Not just code
Enterprise Java PerformanceNot all Java EE performance problems are
classical ‘Java performance problems’Common types of Java EE performance
problem• Resource starvation• Threading problems• ‘Suboptimal configuration’• Network related problems• Scalability problems
A Good UnderstandingConsider the system as a wholeKnow how infrastructure components work• Not just what they do, but how they do it
How do the Java EE specifications say they should work?
ApproachUnderstand the systemUnderstand the environmentUnderstand the situationTalk to people who know• But trust no-one
Take a look for myselfObserve, Hypothesize, Tweak, Test• Rinse and repeat
Case Study 1
Case Study 1Existing customer calls• “We deployed a new version of the application, and it is
running a lot slower”
The Environment• Sun Java 5• WebLogic Server 9.2 Cluster (3 nodes)• WebLogic Integration 9.2 Cluster (3 nodes)• Documentum Document Management• Oracle Database• Solaris OS
Case Study 1The System• Web Application• WLI based workflow system
The situation• New version deployed into the performance
testing environment• Automated performance tests indicate the
application is approximately 30% slower
Case Study 1Observe• No monitoring in place• Some alerting, but no historical data
Hypothesize• If we had more monitoring, we would stand a better
chance
Tweak• Put some monitoring in place• Hyperic HQ from SpringSource
Case Study 1 Test• Re-run tests
Observe• Monitoring indicates that one server is slower
Handling less requests per second Lots of transaction timeouts Higher CPU Less network traffic
Tweak• Add more monitoring to the slow server• Examine log files• Thread dumps!
Case Study 1 Hypothesize• Thread dumps show lots of threads in logging code waiting to
write to the log file• Log files for the slow server have DEBUG messages in them
The other servers don’t
“The logging configurations are identical, the servers are configured with Maven”• Trust no one
Test• Log in to the server and manually check the logging
configuration
Case Study 1Solution• Debug logging was enabled on one server• Turned debug logging off - the system was then
about the same speed as the old release
Hyperic HQ
Hyperic HQMonitoring tool• Not a profiling tool
Historical data• Trends• Abnormal behaviour• ‘Hot’ spots
Wide variety of data• JVM level statistics• JMX statistics• OS statistics
Thread DumpsMy Number 2 tool for finding performance
problems• CTRL-BREAK in windows• Kill -3 on Unix/Linux• Jstack tool• Available from consoles of many application
servers
All threads in the VM and what they are doing at that moment
Thread DumpsA number of thread dumps over time gives a
good picture• Any operation that appears a lot is a suspect• Understand what ‘normal’ thread dumps look like
http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
Thread Dump
Thread DumpsLook near the top of each stackLook for stacks with your code in themLook for long stacksLook for deadlocks and other threading
issues
The UnderstandingWhat does a normal WebLogic thread dump look like? It is not normal to see logging code frequently in a
thread dumpLots of threads all waiting on a single lock object is a
Bad Thing™ If three servers are supposed to do the same thing,
their thread dumps should look similar• Over time
LessonsThread dumps hold a lot of informationInfrastructure configuration faults are more
common than infrastructure bugsAutomated/continuous build and deploy
solutions are no silver bullet• Check the results yourself
Believe your ‘instincts’
Case Study 2
Case Study 2Customer Call• “We deployed our application into the live environment
and it takes several minutes for users to log in”
Environment• Apache web servers• WebLogic Portal 8.1 Cluster (2 nodes)• Oracle Database• Windows Server 2003• Bespoke Single Sign On server
Case Study 2The System• Web application based on WSRP portlets • Oracle database storing user data
The Situtation• The first users to log-in in the morning find that it
takes several minutes• After the first few log-ins, the application runs fine
Case Study 2Hypothesize• The bespoke Single Sign On server makes me
suspicious Bespoke code is tested less
Test• Turn on debug logging for the SSO implementation• Observe timings of log messages
Case Study 2Observe• The logs indicate that the SSO log-in is proceeding
as expected• It appears that loading the users profile data from
the database is taking a long time
Hypothesize• TCP timeouts when connecting to the database
due to a firewall
Case Study 2Test• Observe the connection pool statistics in the
WebLogic console• The console indicates that a large number of
connections have been opened during the time the application has been running Connections are not normally closed and re-opened
• See how long you need to leave the system before the problem occurs
Case Study 2Solution• Discussions with the networking team indicated
that there was a firewall, configured to silently terminate network connections that were Idle for 60 minutes
• Set WebLogic to test connections after they have been idle for 50 minutes.
LessonsConsider the system as a whole• Hardware• Networking• OS• Middleware• Application
The UnderstandingFirewalls are often configured to silently terminate
idle TCP connectionsThe TCP protocol requires that a connection is closed
by both sides, or times out• The time out is several minutes
In a healthy WebLogic connection pool, the number of connections opened since the server started = the maximum number in the pool
Case Study 3
Case Study 3Customer call• “It takes about 20 seconds to render a page, and
the performance does not scale”
Environment• WebLogic Portal 9.1 Cluster (2 nodes)• Oracle 10g Database• Red Hat Enterprise Linux
Case Study 3The System• Online content delivery system• WebLogic Portal with a commercial set of portlets
The Situation• Two problems
Running the performance tests with 20 threads in JMeter is twice as slow as running the tests with 10 threads
Viewing a content item takes around 20 seconds
Case Study 3Handle the two problems separately• They may be related, they may not be
Case Study 3Observe• Viewing a content item takes around 16 seconds
on my laptop
Test• Is the rendering speed dependent on the browser
used?• Is the rendering speed dependent on the client
machine?• What does the page source look like?
Case Study 3Observe• In Opera the page renders quickly except for the
table of contents on the left• In Firefox, the whole page renders at the same
time• The page renders faster in IE and Opera than
firefox• The page renders faster on faster machines• There is a lot of Javascript, and AJAX is used to
load the table of contents
Case Study 3Hypothesize• The AJAX rendering of the TOC is taking a long
time, and slowing down the whole page load
Tweak• Remove the TOC from the page• Disable JavaScript in the browser
Test• The page renders in less than 2 seconds
Case Study 3Hypothesize• JMeter does not execute the javascript, so the poor
performance of JMeter is not related to the poor page load speed
Case Study 3Solution 1• The portlet developers have used AJAX to render
the table of contents for a content item, this is much slower than just constructing the table of contents on the server side
• Rewrite the portlet to construct the table of contents on the server side
• Developers sometimes select a technology to enhance their CVs, not to implement a business requirement
Case Study 3Problem 2 – ScalabilityObserve• Running the tests on JMeter with 10 users, each
page response takes 5s• Running the test with 20 users each page
response takes 12s• JMeter is being run on an old laptop, which is at
100% CPU in both cases
Case Study 3Hypothesize• As the test machine is at 100% CPU, it is the
performance of JMeter that is being measured, not the performance of WebLogic
Observe• WebLogic is running at around 2% CPU usage, with
many idle threads
Case Study 3Tweak• Run the test from a number of more modern
machines, and make sure each one does not exceed 70% CPU
Observe• Four machines can each run 20 threads and get
responses in 1.5 seconds, and WebLogic is still running at around 5% CPU and not struggling
Case Study 3Solution• The problem was that the test client was not able
to generate the loads requested, resulting in the performance of the test client being measured
• Use a larger test client
Useful toolsEthereal/Wireshark• Network traffic sniffer• See when requests/responses were sent/received
Firebug + YSlow• Firefox plugin for performance analysis
LessonsSeparate problems should initially be
prioritised and investigated separately• Keep in mind that they may be related
Ensure the test system can generate the required load• It should have plenty of free resources available
LessonsThe consultant effect• Take a step back• Get a fresh perspective
The UnderstandingA slow test client will give slow resultsClient side rendering is usually less efficient
than server sideWebLogic is normally fast!
What did we learn?Simple tools can provide a lot of informationUnderstanding how the system should
behave will help highlight possible causesExperience is vital• Write a log of what you find
Take a step back from the problem• Use a second pair of eyes
What did we learn?Philosophy• Understand they system as a whole• A deep understanding of how it should work
Tools• Thread dumps• Monitoring tools• Packet sniffing
Techniques• Observe, Hypothesize, Tweak, Test
Questions
Session EvaluationPlease complete a session evaluation and
turn it into any conference staff member or at the registration desk. Thank you.